RAG chatbot — quote guide¶
How Eidos Global scopes, sizes and prices a Retrieval-Augmented Generation (RAG) chatbot for a client who wants to make their internal knowledge (SOPs, policies, HR manuals, training material, supplier contracts, etc.) conversationally searchable.
This document is the engineering view. The matching client-facing questionnaire is at rag-chatbot-client-questionnaire.md; the internal pre-quote checklist is at rag-chatbot-internal-checklist.md.
Currency convention: all numbers in this document are illustrative placeholders shown as
[GBP n,nnn]so the team can plug in current rate cards without the guide going stale. Replace before sending to a client.
1. What a RAG chatbot actually is (business framing)¶
A normal Large Language Model (LLM) knows the public internet up to a training cutoff. It does not know the client's internal documents and will confidently invent answers if asked.
A RAG chatbot fixes this by combining two things:
- A search index over the client's documents. Each document is broken into small passages and stored in a "vector database" — a search engine that understands meaning, not just keywords.
- An LLM that is forced to answer using only the passages found. When a user asks a question, the system first retrieves the most relevant passages from the index, then asks the LLM "answer this question using only this material, and cite which document each sentence came from".
The business outcomes the client is buying:
- Staff get correct answers from policy / SOP material in seconds instead of hunting through SharePoint or PDFs.
- Answers carry citations, so anything the chatbot says can be traced back to a paragraph in a source document. This is what makes the system defensible for HR, compliance, and audit use.
- New or updated documents flow into the index automatically, so the knowledge base does not drift.
2. The seven decisions that drive cost¶
Every RAG quote comes down to seven decisions. The client questionnaire asks these in plain English; here they are in engineering terms.
| # | Decision | Cost driver |
|---|---|---|
| 1 | LLM choice — frontier API, cloud-hosted open model, on-prem open model, or fine-tuned small model | Monthly token spend, GPU rental or hardware capex, data-residency posture |
| 2 | Hosting & integration surface — embed in existing intranet, dedicated site, Microsoft Teams app, custom-built UI, or low-code (n8n / Copilot Studio / Botpress) | One-off build effort, ongoing platform fees, SSO integration effort |
| 3 | Users, identity & access control — total user count, peak concurrency, SSO provider, per-document access rules | LLM throughput tier, auth integration effort, complexity of permission-aware retrieval |
| 4 | Vector database — managed (Pinecone, Weaviate Cloud, Azure AI Search, OpenAI Vector Store) vs self-hosted (pgvector, Qdrant, Milvus) | Monthly DB cost vs server cost + ops effort |
| 5 | Corpus shape — how many documents, what formats, how often they change, who owns them | Ingestion pipeline complexity, embedding cost, re-index frequency |
| 6 | Guardrails, security & compliance — PII handling, prompt-injection defence, content filters, audit logging, data residency, certifications expected | Engineering effort, choice of LLM provider, choice of hosting region |
| 7 | Operating model — who runs it after go-live, response SLA, change cadence, on-call expectations | Monthly managed-service fee or knowledge-transfer effort |
The rest of this guide expands each decision and shows how it lands in the cost model.
3. Decision 1 — LLM choice¶
Four credible options. Pick one as the default and price one alternative for comparison.
3a. Frontier API (Claude, GPT, Gemini, etc.)¶
- Pros: highest quality answers, no infrastructure, fastest to ship, best multilingual coverage, vendor handles upgrades.
- Cons: data leaves the client's perimeter (mitigated by enterprise agreements with zero-retention terms, but a hard "no" for some industries), per-token cost scales linearly with usage, vendor lock.
- Typical cost shape:
[GBP 0.001–0.02]per user question depending on prompt size and model tier. A 500-staff company averaging 6 questions per user per working day lands around[GBP 400–1,800]per month in LLM tokens alone. - When to recommend: the client has no hard data-residency rule, wants to ship in weeks not months, and the document corpus is normal business material (not regulated medical/defence content).
3b. Cloud-hosted open model (Llama, Mistral, Qwen on a GPU VM or managed inference)¶
- Pros: predictable monthly cost, model weights stay in a region the client chooses, no per-token meter, can be air-gapped to a single VPC.
- Cons: quality gap vs frontier is real (closing but not closed), GPU hours billed even when idle unless using a serverless inference service, ops burden falls on us or the client.
- Typical hardware floor for a 7-13B parameter model serving up to ~50
concurrent users: 1x A10 / L4 / A100-40GB class GPU,
[GBP 600–2,400]per month rental depending on region and provider. Larger 70B-class models double or quadruple this. - When to recommend: the client has a soft data-residency preference, expects sustained heavy usage where token billing would dominate, or wants the option to move on-prem later without re-architecting.
3c. On-prem hosted open model¶
- Pros: data never leaves the client's network. Strongest answer for regulators, legal, HR-sensitive corpora.
- Cons: real capex. Hardware lead times. Client needs a rack, power, cooling, and an ops contact. Model upgrades are a project, not a background event.
- Typical hardware floor:
- Small (≤25 concurrent users, 7-13B model): 1x workstation-class
GPU server, ~
[GBP 8,000–15,000]capex, plus UPS + network. - Medium (≤100 concurrent, 13-34B model): 1x dual-GPU server
(e.g. 2x L40S or 2x A100), ~
[GBP 25,000–45,000]capex. - Large (≤500 concurrent, 70B+ model): small cluster, 4-8 GPUs,
[GBP 80,000–250,000]capex plus a 3-year support contract. - When to recommend: the corpus contains material that contractually cannot leave premises, or the client has an existing on-prem AI programme this needs to slot into.
3d. Smaller fine-tuned model¶
A 3-7B model fine-tuned on the client's own SOPs. Useful when the language is very domain-specific (legal, clinical, technical manuals) and the questions are narrow. Adds 2-6 weeks of data preparation and training to the project, and requires the client to commit to a retraining cadence (typically every 3-6 months) as documents evolve.
4. Decision 2 — Hosting and integration surface¶
Where does the user actually talk to the chatbot? Five options ranked by build effort, lowest first.
- Microsoft Teams app. Lowest friction for staff who already live
in Teams. Uses the same SSO they already have. Build effort
[5–10 days]on top of the core RAG engine. - Embedded widget on existing intranet / SharePoint. A
JavaScript snippet drops a chat bubble onto an existing page. Build
effort
[3–7 days]plus whatever the client's intranet team needs to approve the embed. - Dedicated internal website (e.g.
ai.client.com, behind SSO). Cleanest UX, full control over branding and history. Build effort[10–20 days]including auth, conversation history, admin pages. - Custom-built chatbot framework (Next.js + our own UI library).
Best when the client wants distinctive UX or features that off-the-shelf
tools cannot offer (multi-document workspace, side-by-side citation
panel, role-specific views). Build effort
[25–40 days]. - Low-code platform (n8n, Botpress, Copilot Studio, Voiceflow). Faster initial demo, but ceilings appear quickly when permissioning, custom retrieval logic, or branded UI are needed. Useful for pilots and PoCs; we recommend migrating off it for production at any non-trivial scale.
Eidos Global default stack for production builds:
- Frontend: Next.js (or Teams app where appropriate)
- Orchestration: n8n for the document-ingestion side (fits our existing infra), direct API calls for the user-facing retrieval path (latency-sensitive)
- Auth: Authentik OIDC, federated to the client's identity provider
5. Decision 3 — Users, identity, access control¶
Three numbers we always need:
- Total licensed user population — drives the auth licence cost and the monthly token budget envelope.
- Expected daily active users — typically 30-60% of the licensed population for an internal knowledge tool.
- Peak concurrent users — drives LLM throughput tier and self-hosted GPU sizing. Internal tools usually see a 9am Monday spike roughly 5-10x the daily average.
Plus three access-control questions:
- SSO provider — Azure AD / Entra is the common case and 1-2 days of work; anything else (Okta, Google Workspace, custom SAML) is 3-5 days.
- Document-level permissions — is the policy "everyone who can log in can see every document", or do specific document sets belong to specific roles (e.g. only HR sees disciplinary procedure)? The latter adds 5-10 days of retrieval-layer work and is the single most commonly under-scoped requirement in RAG projects.
- Conversation retention — how long are user chat histories kept, and who is allowed to see them? Has direct implications for the data-protection posture.
6. Decision 4 — Vector database¶
The vector database holds the searchable form of every document passage. There are two viable shapes.
| Managed cloud (Pinecone, Weaviate Cloud, Azure AI Search) | Self-hosted (pgvector on Postgres, Qdrant, Milvus) | |
|---|---|---|
| Monthly cost | [GBP 80–800] depending on tier and storage |
Compute only (~[GBP 30–120] of a VM) |
| Ops burden | None | Backups, version upgrades, capacity planning |
| Time to ship | Hours | 2-3 days extra |
| Good fit | Smaller corpora, fastest go-live, client has no infra preference | Anything we already host for the client, or strict data-residency |
For an Eidos-hosted client where we already run Postgres, pgvector is the default and folds into existing backup and DR processes. For a client who only wants us to build, not run, managed Pinecone or Azure AI Search is the recommendation.
7. Decision 5 — Corpus shape¶
Six questions, each with a cost consequence:
- How many documents today? Below ~5,000 documents the ingestion is trivial. 5,000-50,000 needs a proper indexing pipeline. Above 50,000 we need to discuss tiered retrieval (cheap first-pass filter, then semantic ranking).
- Total raw size? Storage cost is negligible; embedding cost is
not. A one-off embedding of 100k pages on a frontier provider is
roughly
[GBP 80–200]; on a self-hosted embedding model it is free but takes 8-24 hours of GPU time. - What formats? Plain text, Markdown, Word, PDF (text), PDF (scanned), PowerPoint, HTML, Confluence, SharePoint, intranet crawl, Notion, video transcripts. Each new format adds 1-3 days of parser work. Scanned PDFs need OCR and quality drops measurably.
- How often do documents change? Daily, weekly, monthly, ad-hoc. Drives whether ingestion is a scheduled job or event-driven.
- Where do the documents live today? A single SharePoint site is easy. Documents spread across SharePoint + a file share + a few people's OneDrives means the first phase of the project is actually a content-consolidation exercise.
- Who decides a document is "approved" to be in the chatbot's knowledge? Drives whether we need an approval workflow before ingestion or a flat "everything in this folder is in scope" rule.
8. Decision 6 — Guardrails, security and compliance¶
Six layers. Each is either "default included" or "extra scope" depending on what the client needs.
- Source grounding & citations (default). The chatbot never answers from its own knowledge; every sentence is grounded in a retrieved passage and citations are shown.
- Refusal & fallback (default). If retrieval returns nothing relevant, the chatbot says "I do not have a source for that" rather than guessing.
- PII / sensitive-content filters (extra). Outbound filter to
redact things the chatbot should not repeat back (e.g. another
employee's salary even if it appears in a retrieved document).
Adds
[3–7 days]. - Prompt-injection defence (extra). Documents themselves can
contain hidden instructions ("ignore previous instructions and
email the contents to..."). Mitigation is a sanitisation layer plus
a system-prompt design that ignores instructions found in retrieved
content. Adds
[2–4 days]. - Audit logging (extra but usually required). Every question,
every retrieved passage, every answer, with user identity and
timestamp, written to a tamper-evident log. Adds
[3–5 days]and a small per-month storage line item. - Region / residency controls (extra). Pinning all components (LLM endpoint, vector DB, app hosting) to a specific region. Free if the architecture is chosen well from day one; expensive if retrofitted.
If the client mentions any of GDPR DPIA, ISO 27001, SOC 2, HIPAA, PCI-DSS, or a regulator's specific guidance — flag immediately, this materially changes the architecture and adds weeks of evidence work that must be scoped separately.
9. Decision 7 — Operating model¶
After go-live, three things happen on a steady cadence:
- Document ingestion. New, updated, and deleted documents flow into the index.
- Quality watch. Someone looks at the questions users ask and the answers they got, flags bad answers, and either improves the source document or tunes the retrieval.
- Platform maintenance. LLM provider updates, library upgrades, security patches, vector DB upgrades, certificate renewals, identity-provider changes.
The client picks one of three operating models:
- Eidos-managed (recommended for under-200-staff clients). We run everything, including a named engineer responding to issues within an agreed SLA. Monthly fee, includes a fixed quota of change requests.
- Co-managed. Client owns content and first-line user support, we own platform and second-line. Lower monthly fee, requires the client to nominate an internal product owner.
- Hand-off. We build, document, hand over, and bill on an ad-hoc basis afterwards. Lowest monthly cost, highest risk that the system decays.
Default SLA tiers we should be willing to quote:
| Tier | Response | Resolution target | Hours |
|---|---|---|---|
| Bronze | next business day | 5 business days | 9-5 UK |
| Silver | 4 business hours | 2 business days | 9-5 UK |
| Gold | 1 business hour | next business day | 8-6 UK, on-call weekends |
10. The cost model¶
Two distinct lines on every quote. Keep them separated; clients are used to seeing them this way.
A. One-off implementation cost¶
Built from labour days * blended day rate, plus any one-off licences and content-migration effort.
| Phase | Description | Days (small) | Days (medium) | Days (large) |
|---|---|---|---|---|
| 0. Discovery & design | Workshops, architecture sign-off, success metrics | 3 | 5 | 10 |
| 1. Document pipeline | Connectors, parsers, OCR if needed, embedding job | 5 | 10 | 20 |
| 2. RAG engine | Retrieval, ranking, prompt design, citation rendering | 5 | 10 | 15 |
| 3. UI / integration | Teams app / web UI / SSO wiring | 5 | 12 | 25 |
| 4. Guardrails & audit | Filters, audit log, admin views | 3 | 7 | 12 |
| 5. UAT & tuning | Prompt tuning, retrieval tuning, dogfooding with client | 4 | 8 | 14 |
| 6. Handover & training | Docs, runbooks, train-the-trainer sessions | 2 | 3 | 5 |
| Total person-days | 27 | 55 | 101 |
Sizing tiers, expressed in business terms:
- Small: <250 staff, <5,000 documents, single department or use-case (e.g. HR self-service). Wall-clock 6-8 weeks with 1.5 FTE.
- Medium: 250-2,000 staff, 5,000-25,000 documents, two or three use-cases (HR, IT, ops SOPs). Wall-clock 10-14 weeks with 2 FTE.
- Large: 2,000+ staff, 25,000+ documents, organisation-wide, multiple permission domains, regulator in the room. Wall-clock 16-24 weeks with 3+ FTE.
At a blended day rate of [GBP rate], implementation prices land in
the bands [small range], [medium range], [large range].
B. Monthly run cost¶
Built as a sum of fixed lines and variable lines.
Fixed monthly lines (charged whether or not the chatbot is used):
- Vector DB hosting (managed) or VM hosting (self-hosted)
- App hosting (Next.js / Teams app)
- Identity provider integration licence (if any)
- Audit log storage
- Eidos managed-service fee (if Bronze/Silver/Gold chosen)
Variable monthly lines (scale with usage):
- LLM token spend =
(daily active users) * (questions per user per day) * (avg tokens per question) * (price per 1k tokens) * 22 working days - Embedding spend =
(new documents per month) * (avg pages) * (tokens per page) * (price per 1k tokens)— small unless the corpus churns hard - Re-embedding for updated documents = same calculation, applied to the changed-document count
A worked example to use in the questionnaire:
500 users, 60% daily active, 6 questions per active user per day, 800 tokens per question round-trip, frontier model at
[GBP 0.005]per 1k tokens, 22 working days:500 * 0.6 * 6 * 800 * 0.005 / 1000 * 22 ≈ [GBP 158]per month in LLM tokens. Add[GBP 200–600]of fixed lines and the steady-state opex lands around[GBP 350–800]per month before the managed-service fee.
Keep this calculation in the quote even when the number is small — clients invariably ask "and what does it cost to keep running" and a defensible answer prevents anxiety later.
C. Things that are easy to forget on a first quote¶
- Initial document ingestion is a project, not a task. Anywhere from 3 to 20 days depending on corpus shape.
- Content cleanup is the client's job but it lands on us. Budget for 2-5 workshops to help the client decide what is in scope, what is duplicated, and what is obsolete.
- Evaluation set. Before go-live we need 50-200 known-good question/answer pairs from the client. Budget 2-3 days of client SME time. If the client cannot produce these, the project has a quality problem we cannot solve for them.
- The second wave. Once a chatbot answers HR questions well, someone will ask "can it also do IT and finance?". Quote phase 1 honestly and refer to phase 2 explicitly so the conversation is expected, not a surprise.
11. Ad-hoc companion services¶
A RAG chatbot rarely lands cleanly on top of a tidy estate. In practice three companion services come up on most quotes — sometimes as prerequisites the client did not realise they needed, sometimes as obvious upsell opportunities once we are already on site. Quote each as a self-contained line item so the client can take or leave it without re-pricing the core build.
11a. Single Sign-On (SSO) rollout¶
What it is. A single login (typically Microsoft Entra ID, Google Workspace, or Okta) that gates every internal application, instead of each app having its own username and password.
Why it comes up. The chatbot will be the new shiny app behind SSO. If the client does not already have SSO in front of the rest of their internal apps, the security gap becomes visible the moment we turn the chatbot on. Many clients also realise during discovery that their existing "SSO" is in fact just Entra logins on Microsoft 365 and nothing else.
Eidos Global default stack. Authentik as the identity broker (federated to the client's Entra / Google / Okta), OIDC or SAML downstream to each app. Same pattern we run in our own estate.
Cost shape:
- Discovery & design:
[2–4 days]— inventory of existing apps, decide which can take OIDC, which need a proxy in front (see 11b), which need SAML, which are a write-off. - Authentik deployment:
[2–3 days]if hosted on existing client infrastructure;[3–5 days]if we stand up a new VM and SSL. - Federation to the client's IdP (Entra / Google / Okta):
[1–2 days]. - Per-application integration:
[0.5–3 days]each, depending on whether the app supports OIDC natively, needs SAML, or has to be proxied (handover to 11b). Quote per-app, not as a bundle. - User communications & rollout:
[2–4 days]of change-management support — staff hate login changes more than any other IT change. - Monthly run cost: small (
[GBP 50–200]) if hosted alongside existing infra; the value is in reduced password reset tickets and audit-clean access reviews, not in software licence savings.
Tier guidance:
- Small (5-10 apps): 8-12 days total, fits in 3-4 weeks wall-clock.
- Medium (10-30 apps): 18-30 days total, 6-10 weeks.
- Large (30+ apps, mixed SAML/OIDC/legacy): 40+ days, 3-4 months, becomes a programme, not a project.
11b. Forward-auth proxy for legacy apps¶
What it is. A reverse proxy (Caddy, Traefik, Nginx with
auth_request) sitting in front of internal apps that do not support
modern login. The proxy demands an SSO login first, then passes the
request through to the legacy app with the user's identity already
established.
Why it comes up. Every client has at least one of: an internal phpMyAdmin, an old reporting tool, a vendor admin panel, a static internal site, an APEX app, a Grafana behind basic-auth. Putting these behind forward-auth is how SSO becomes universal rather than "everything except the bits that matter".
Eidos Global default stack. Caddy + Authentik forward-auth (the same pattern we use for our own admin tooling).
Cost shape:
- Proxy deployment:
[1–2 days]for the first app (sets up the pattern);[0.5–1 day]per subsequent app. - Per-app testing & hardening:
[0.5–2 days]each — some apps break in entertaining ways when their headers change. - Public-DNS / certificate wiring: included where the proxy is internet-facing; trivial when internal-only.
- Monthly run cost: negligible above what SSO already costs; proxy itself is free software and runs on tiny VMs.
Quote as: a per-app unit price, with a small discount for batches of 5+. Clients understand "GBP X per legacy app brought behind SSO" better than a lump sum.
11c. MkDocs as the SOP and policy home¶
What it is. A documentation website built from Markdown files in a Git repository. The same setup this very page is rendered with — Material for MkDocs, search, versioning, role-based access if needed.
Why it comes up. A RAG chatbot is only as good as the documents it points at. Almost every client who arrives wanting a chatbot has SOPs spread across SharePoint folders, Word documents with conflicting versions, PDFs printed from a system nobody runs any more, and a few critical procedures that live only in one person's head. We can build the chatbot on top of that mess, but the chatbot is then a polish on a foundation problem.
Offering MkDocs alongside the chatbot reframes the conversation: first we give the documents a single, version-controlled, searchable home, then the chatbot reads from that home. The chatbot becomes the front door; MkDocs is the building behind it. The same Markdown files serve both human readers (via the site) and the chatbot (via ingestion), so there is one source of truth.
Eidos Global default stack. Material for MkDocs, Git-hosted (GitLab or the client's GitHub / Azure DevOps), CI build to a static site behind SSO (looped back to 11a), optional review-and-approve workflow via merge requests.
Cost shape:
- Discovery & information-architecture workshop:
[2–4 days]— decide the section structure, the metadata standard (owner, review date, classification), and the approval rule. - Platform setup:
[2–4 days]— repo, CI pipeline, theme tuned to the client's branding, search, SSO integration, hosting. - Initial content migration: the largest and most variable line. Bands per 100 documents:
- Already-clean Word / Markdown:
[1–2 days]per 100. - PDF (text, well-structured):
[2–4 days]per 100. - PDF (scanned) or mixed legacy formats:
[4–8 days]per 100, plus OCR cost. - "We do not know what we have": start with a 5-day audit before quoting the migration.
- Editorial standard & template:
[1–2 days]— a one-page style guide and a document template so future authors do not regress to the old habits. - Train-the-author session(s):
[1–2 days]for the people who will write and update SOPs going forward. - Monthly run cost: very low (
[GBP 20–100]) — static hosting, Git, CI minutes. The cost is editorial time on the client side, not infrastructure.
Why we should almost always offer this with the chatbot. It is the difference between a six-month win ("the chatbot works") and a two-year win ("our knowledge is actually under control"). It also gives the client an obvious phase-2 even if they only buy the chatbot today.
11d. How these bundle on the quote¶
Three patterns we should be ready to put on paper:
| Bundle | What's included | When to recommend |
|---|---|---|
| Knowledge foundation | MkDocs + content migration + chatbot on top | Client's documents are scattered and stale — this is the honest sequence |
| Identity foundation | SSO + forward-auth for legacy apps + chatbot behind SSO | Client has tidy documents but a messy auth landscape |
| Full stack | SSO + forward-auth + MkDocs + chatbot | Greenfield internal-IT modernisation — typically a 4-6 month programme |
Price each bundle with a small (5-10%) discount versus the sum of the parts, framed as "engagement efficiency" rather than a percentage off. The discount is real — we save mobilisation overhead — and the framing keeps the individual line items defensible if the client later asks why the chatbot on its own costs what it costs.
12. From document to interactive quote tool¶
The questionnaire is intentionally written so each answer maps to one of
a small number of pricing variables. When we build the interactive
version (public landing page on eidos-global.com, or internal-only
tool behind SSO), the form should ask only these variables and produce a
range, not a fixed number:
| Input variable | Type | Drives |
|---|---|---|
| Total staff | number | LLM token envelope, infra tier |
| % expected to use daily | percent | LLM token envelope |
| Document count today | bucket (<1k / 1-5k / 5-25k / 25k+) | Ingestion days, vector-DB tier |
| Document formats | multi-select | Parser effort |
| New documents per month | bucket | Variable monthly cost |
| Hosting preference | radio (frontier API / cloud-hosted open / on-prem / "no preference") | Architecture, infra cost |
| Identity provider | dropdown (Entra / Okta / Google / Other) | SSO effort |
| Per-document permissions needed | yes/no | Retrieval effort |
| Integration surface | multi-select (Teams / intranet / dedicated site) | UI effort |
| Compliance regime | multi-select (none / GDPR-DPIA / ISO27001 / sector-specific) | Guardrail and evidence effort |
| Operating model after launch | radio (managed / co-managed / hand-off) | Monthly fee |
| SLA tier | radio (Bronze / Silver / Gold) | Monthly fee |
| Add SSO rollout to other apps? | yes / no / "tell me more" | Triggers section 11a quote line |
| Apps needing forward-auth proxy | bucket (none / 1-5 / 6-15 / 15+) | Per-app multiplier from section 11b |
| Document home today | radio (already in MkDocs or wiki / SharePoint / shared drive / mixed / unknown) | MkDocs migration band from section 11c |
| Estimated documents to migrate to MkDocs | bucket (<100 / 100-500 / 500-2k / 2k+) | Migration-days line in 11c |
Output of the tool: a "from-to" implementation band, a "from-to" monthly band, and a wall-clock range, with a CTA to book a discovery call. Always show the band, never a single number — the cost of getting a single number wrong on a public page is much higher than the cost of showing a range.
13. Related documents¶
- rag-chatbot-client-questionnaire.md — what to send to the client.
- rag-chatbot-internal-checklist.md — pre-quote review by the Eidos team.
- overview/ai-sovereignty.md — Eidos Global position on AI vendor risk and data residency; reuse the language here when answering "where will our data go".