Skip to content

Bug Reporter v2 — centralized bug-reporting service (design proposal)

Summary — A lightweight, n8n-centric service that turns the one-way Parallax Bug Reporter into a centralized intake + SLA + lifecycle + two-way-comms + reporting platform across all ~13 apps (APEX and non-APEX). Postgres and Jira are the two sources of truth; n8n on workflow.eidos-global.com (Dokploy/E2) does the orchestration; blobs stream through n8n to OCI Object Storage; OpenAI does screening + RCA. This is a proposal to discuss, not yet built. Status: draft, decisions locked 2026-06-02, owner Vishnu Kant.

Context

v1 is a JS widget bolted onto one APEX app, one-way: it raises a Jira ticket via n8n and emails the team, then goes silent — no reporter feedback, no lifecycle, no SLA, the STATUS column is inert. We want bug reporting across the whole estate (APEX + WordPress + Teams Bot + CRMs) with SLAs, AI triage, two-way customer comms, full comms logging, and reporting — without turning it into a full-blown product. The whole design biases to "reuse the estate, lean on Jira + n8n, minimal custom code." First delivery is a like-for-like replacement for the Parallax widget so Parallax gains a fallback before anything else changes.

Guiding principles

  • Two sources of truth: Postgres and Jira. Jira holds ticket status/lifecycle; Postgres (bugsys on E2) holds config, the durable intake buffer, SLA/metrics ledger, and the full comms log. Everything else (n8n state, AI output, OCI) is derived and rebuildable. If any downstream "fancy" piece fails, we can still work tickets manually from Postgres + Jira.
  • Build on the new n8n (workflow.eidos-global.com, Dokploy/E2). Never the legacy O1 instance (SPOF, being retired).
  • Email is dumb transport. All Jira-to/from-email translation lives in n8n; every message is logged.
  • Blobs never persist in Jira/email/n8n — they stream through n8n to OCI Object Storage; everything else carries only a PAR URL + metadata.
  • One canonical normalized intake payload, written to Postgres first, before any processing.
  • Jira Software only — no new licensing. Priority is decided by simple deterministic math over AI + user inputs; we do not adopt Jira Service Management.
  • External identity = email OTP + CAPTCHA, not federated accounts. The same 5-minute email-OTP gate verifies app-down reporters and authenticates the chatbot. CAPTCHA is Altcha (open-source, self-hosted, proof-of-work, no third-party accounts).
  • Secrets live in Vault (vault.448.global), read by n8n at runtime — consistent with the rest of the estate.

Target architecture

flowchart LR
  subgraph Channels
    W[Embedded widget<br/>APEX + non-APEX]
    F[Standalone app-down form<br/>OTP + CAPTCHA]
    E[Inbound email<br/>support@eidos-global.com]
    A[Raw API<br/>Teams Bot / CRM]
  end
  W & F & E & A --> RAW[(intake_raw<br/>Postgres write-first)]
  RAW --> N[n8n /intake<br/>auth + rate-limit + normalise]
  N --> P[(bugsys Postgres<br/>profiles + ledger + comms + otp)]
  N --> S[AI screening<br/>type + severity + dedupe + guide]
  S -->|priority = simple math| J[Jira<br/>per-app bug project]
  S -->|question/guide hit| W
  J --> RCA[AI RCA scan<br/>clone prod branch -> AI-RCA comment]
  J <--> C[Two-way comms<br/>threaded email + comms_audit log]
  N --> O[(OCI Object Storage<br/>blobs stream-through, 2yr lifecycle)]
  J --> L[(SLA ledger in Postgres)]
  L --> R[Scheduled reports<br/>per-org + internal all-up]
  SLA[SLA poller + escalation + auto-close] --> J
  CHAT[n8n AI Agent chat form<br/>OTP + CAPTCHA] --> P & J

Single store: bugsys Postgres on E2

Unified per decision — no n8n Data Tables for this system (kept only for the unrelated careers reporting). One dedicated Postgres container on Dokploy/E2, backed up to OCI with a real restore drill. It holds:

  • Config (Bug Profiles): service, contact, sla_tier, escalation_step, bugtype_mapping, user_guide, report_schedule, profile_audit
  • Durable buffer: intake_raw (every raw payload, written before processing)
  • Lifecycle + metrics: ticket_state / sla_ledger (denormalised, one row per ticket, stamped off Jira events)
  • Comms: comms_threads (threading/idempotency state) + comms_audit (append-only full log)
  • Reliability: dead_letter (failed executions for replay)
  • Auth: otp (issued codes, 5-min expiry, attempt counters)

The nine capabilities

1. Intake and channels (multi-app)

One canonical n8n webhook /intake; a "Normalise" node maps each channel onto a single payload (appKey, reporter, type, title, desc, env, diagnostics, attachment PAR-urls), and the raw payload is written to intake_raw first. Channels: - Embedded widget — reuse v1 bug-reporter.js almost verbatim (already separates generic env capture from APEX-only diagnostics, already supports a pure webhookUrl+X_API_KEY path). Add an appKey option, host once, embed per app. - Standalone "app-down" form — same DOM, manual app-picker, independent host/domain (e.g. report.eidos-global.com); gated by email OTP + Altcha so it can't be spammed and the reporter email is verified. - Inbound email-to-ticket — dedicated mailbox, alias-per-app routing. - Raw API — same endpoint for backends (Teams Bot, CRMs).

Auth: per-app appKey + secret (X_API_KEY as v1; HMAC for server-side callers). MVP effort: S–M.

2. Bug Profiles config

One profile per service (app/tenant) in bugsys:

service (service_id PK, display_name, org_name, active,
         channels jsonb, ai_screening, ai_rca, ai_chatbot,
         jira_project_url, jira_project_key, codebase_url, prod_branch,
         autoclose_days default 14,
         business_hours jsonb,   -- per-service; default UK only, optional UK+India
         created_at, updated_at)
contact (contact_id PK, service_id FK, name, email, role, party 'internal|external',
         alert_key_bug, alert_reports, notify_status_change)
sla_tier (service_id FK, priority 'P1|P2|P3', response_mins, resolution_mins,
          clock 'business|calendar', PK(service_id,priority))
escalation_step (service_id FK, step_no, after_mins, contact_id FK, PK(service_id,step_no))
bugtype_mapping (service_id FK, category 'bug|feature|question', jira_issue_type, extra_fields jsonb,
                 PK(service_id,category))
user_guide (service_id FK, topic, url, PK(service_id,topic))
report_schedule (service_id FK, cadence 'weekly|monthly', recipients jsonb, PK(service_id,cadence))
profile_audit (audit_id, table_name, service_id, changed_by, change jsonb, changed_at)  -- generic trigger
Seed SLA defaults: P1 240/1440, P2 1440/5760, P3 2880/10080 minutes. Business hours default to UK only ({"tz":"Europe/London","days":[1..5],"start":"09:00","end":"17:00"}); a service can override to UK+India. Edits via SQL migrations + a small n8n "profile admin" webhook form for the common 80%; a thin admin UI (APEX or NocoDB on the same Postgres) only if non-technical owners need self-serve. MVP effort: S (container + schema + audit), M (profile-hydrate query).

3. SLA engine, lifecycle, escalation, auto-close — Jira Software only

  • Priority by simple math: AI screening proposes severity, the reporter supplies urgency + impact; priority = a fixed urgency×impact matrix mapped to P1/P2/P3 (deterministic, explainable, no JSM). Targets come from sla_tier.
  • Compute targets at intake, snapshot onto the Jira ticket (sla_response_due, sla_resolution_due, priority used) so breach detection is a timestamp compare and policy is frozen even if the profile later changes. Mirror the same timestamps into the Postgres ledger.
  • Lifecycle (one shared Jira workflow): New -> Screening -> Triaged (response clock stops) -> In Progress -> Awaiting Customer (resolution clock pauses) -> Resolved (clock stops, auto-close countdown starts) -> Closed; plus Needs Info (internal, does not pause) and Won't Fix (terminal).
  • Breach + escalation: n8n cron poll (5–10 min) over open tickets; rungs from escalation_step (L0 approaching -> nudge assignee; L1 breach -> POC email; L2 -> leadership). Write escalation_level back to Jira so polling is stateless/restart-safe. Email only (Gotify broken, KI-022).
  • Auto-close after autoclose_days no reporter response — Jira Automation scheduled rule preferred, n8n cron fallback. MVP effort: M.

4. AI initial screening

One intake-triggered single OpenAI call (JSON mode) before the Jira node: type (bug/feature/question/not-a-bug), severity, duplicate check (cheap Jira JQL text pre-filter feeding candidate summaries into the same call — no vector DB), and guide-deflection (question matching a user_guide URL is offered the link, possibly skipping Jira). Fail-open (AI down -> raise ticket anyway), confidence floor -> human review, and screen only after rate-limit + OTP so abuse can't burn credits (KI-046). MVP effort: S–M.

5. AI codebase RCA scan

Async, best-effort, never blocks intake or Jira creation. After ticket creation: shallow git clone --depth 1 --single-branch -b <prod_branch> of the profile's repo on git.projecteidos.com via an n8n Execute Command node, assemble a context-lite pack (README + git grep of error tokens with surrounding lines + recent prod-branch diffs + depth-limited file tree), one OpenAI call -> a tagged [AI-RCA] Jira comment (bug-confirmed?, suspected cause with file:line, probable fix, evidence, confidence, "advisory" disclaimer). Mandatory secret controls: read-only deploy key, deny-list .env/*.pem/secrets, regex redact pass, ephemeral checkout deleted on success and error. Skip cleanly if no codebase_url (covers WordPress/CRM apps with no repo). Grep-context first; RAG deferred or never. MVP effort: M.

6. Two-way customer comms (Jira <-> email) + full comms log

  • Outbound: engineer writes {{customer}} ... {{/customer}} in a Jira comment; a Jira webhook fires n8n, which strips the tags and emails just that span to the reporter. Untagged comments are internal and never sent (fail-closed).
  • Threading: subject suffix [EG-PARALLAX-1234] + X-Eidos-Ticket header + RFC In-Reply-To/References against the stored canonical Message-ID, so the whole conversation stays one mail thread.
  • Inbound: n8n IMAP on one canonical mailbox (support@eidos-global.com); match reply to ticket (header -> subject token -> References); append as an internal Jira comment (can't re-trigger outbound); stream attachments to OCI. Unmatched mail -> human triage folder + email alert, never dropped.
  • Status-change emails: key transitions (e.g. Closed) email the customer with RCA — from an {{rca}} tag or a dedicated Jira RCA field.
  • Full communications log (required): every outbound email, inbound reply, and notification is written to the append-only comms_audit table in Postgres at send/receive time — direction, ticket, from/to, subject, timestamp, channel, body (or body hash per the GDPR policy), Message-ID, attachment PAR URLs. This is the tamper-evident "what we told the customer and when" record, independent of editable Jira comments; internal alerts are logged the same way.
  • Loop/dedupe guards: drop auto-replies/OOO/bounces; ignore self-sent mail; idempotency on Jira comment ID + email Message-ID; threading/idempotency state in comms_threads. MVP effort: M (+ S for the audit log — the hooks already exist).

7. Attachments via OCI Object Storage + PAR (stream-through)

Single private bucket in EIDOSDev1 (uk-london-1), prefix bugs/<app>/<jira-key>/<uuid>-<file>. Proxy-via-n8n: the browser/widget POSTs the blob to n8n, which streams it straight to OCI and never persists or buffers it to disk (pipe the request body to the PUT; release immediately; no comms/Jira/n8n storage of the bytes). The write PAR stays server-side, never in browser JS. Reads: short-lived per-object read PAR minted from n8n (fallback: pre-created long-lived PAR); Jira/email carry only the URL + metadata. Type allow-list, ~10–25 MB cap. OCI lifecycle auto-deletes objects after 2 years. MVP effort: S–M.

8. Reporting and metrics

Clone the proven careers-digest n8n pattern, but read the sla_ledger table in Postgres (not Data Tables): scheduled workflow aggregates in one Code node, renders a branded HTML email, sends per-org (external-safe, org-filtered in the query) plus an internal all-up league table. Never query Jira at report time — read the ledger. Metric set: total + type split + by-priority; MTTA and MTTR per priority (mean and median, business hours); SLA attainment % (response and resolve, separately); open/backlog + ageing buckets; reopen rate; first-contact resolution; auto-closed count; net flow; period-over-period deltas. Manual on-demand run via a form trigger calling the same maths. MVP effort: M.

9. AI chatbot — n8n AI Agent + custom chat form (OTP + CAPTCHA)

Reworked per decision: no Authentik/Open WebUI. A custom chat form (hostable as a static page or n8n form) backed by an n8n AI Agent workflow. Authentication is the same 5-minute email OTP + Altcha; on success n8n issues a short-lived session and scopes Jira reads server-side to that verified email: - A normal user sees only tickets they reported (match on verified email). - An org-admin (flagged in contact.role='org_admin') sees all tickets for their org only — n8n maps the verified email -> org via bugsys, and never trusts a client-supplied scope. OTP state lives in the Postgres otp table. This doubles as the identity mechanism for the app-down form (capability 1). Deferred to a later phase; ship v2 without it.

Cross-cutting hardening

# Gap Fix (locked) Phase
1 Durable inbox — if n8n is down (when the app-down form is most used) the bug is lost Write raw payload to intake_raw in Postgres first, then process; recoverable/replayable manually 0
2 Nothing watches the bug system itself Synthetic canary posts a test intake every N min, emails if it doesn't land (uptime-kuma, independent host) 0
3 Dead-letter / replay Global n8n Error Trigger -> dead_letter table + alert + manual replay workflow 0
4 End-to-end idempotency Carry intakeId throughout; store intakeId -> jira_key, look up before creating 0/1
5 Rate-limit / spam / abuse Per-appKey + per-IP limit in the auth node (defaults: per-IP 5/min, 30/hour; per-appKey 500/day, all configurable per service); OTP + Altcha on the public form; global daily ceiling ~2,000/day -> alert; screen only after rate-limit 1
6 GDPR / PII / retention (in scope) Optional client-side PII redaction in the widget; attachments auto-delete at 2 years; matching retention windows for ledger/comms/closed tickets; documented erasure-by-email path 1 (bucket) / 4 (policy)
7 Tenant isolation untested Cross-org leak test pack (two fake orgs) after any profile change; org mapping treated as security-critical config 1+
8 Secrets home Runtime secrets (appKey/HMAC, OCI write PAR, Jira SA token, OpenAI key) in Vault (vault.448.global), read by n8n at runtime. Caveat: Vault sits on the O1 SPOF and is unbacked — its own resilience (backup, eventual move off O1) is a tracked estate risk this system inherits 0
9 Config DB backup never restore-tested Real restore drill as Phase-0 acceptance 0
10 Full comms log + status page Append-only comms_audit (see §6); static status page (uptime-kuma public), independent host 2 / 4
11 v1 -> v2 migration across 13 apps Strangler: v2 alongside v1; Parallax first, like-for-like (fallback retained); fan out; leave v1 BUG_REPORTS rows as read-only history 1 then 4

Phased delivery plan

Phase Goal Includes Entry
0 — Foundations De-risk first bugsys Postgres on E2 + OCI backup with restore drill; secrets in Vault; OpenAI credit monitor (KI-046); canary + email alerting; global error/dead-letter workflow; write-first intake_raw
1 — Intake (Parallax, like-for-like) One reliable pipe + Parallax fallback /intake + normalise + per-appKey auth + rate-limit; generalise v1 widget; Bug Profiles core + cross-org leak test; idempotent Jira create; OCI attachments (stream-through, 2yr); standalone app-down form (OTP + CAPTCHA) Phase 0
2 — Lifecycle + comms Two-way value SLA snapshot + breach poll + L1 escalation + auto-close; lifecycle workflow; {{customer}} outbound + inbound->Jira + dedupe/loop guards + comms_audit log; notification gate Phase 1
3 — AI assist Deflection + RCA AI screening (fail-open, simple-math priority); AI RCA (context-lite, deny-list+redact, advisory); per-profile rca toggle KI-046 monitor live
4 — Reporting + status + rollout Visibility + fan-out sla_ledger + monthly per-org + internal all-up; status page; GDPR retention across stores; strangler rollout to remaining ~12 apps; onboarding runbook Phase 2
5 — Optional chatbot Nice-to-have n8n AI Agent chat form + email OTP + CAPTCHA, scoped to verified email / org-admin demand confirmed

Dependencies (not hard gates)

  • Atlassian service-account token — still valid; not a blocker. Worst case we move to a paid Atlassian plan. Build on the SA token, keep paid-plan migration as the fallback (related: KI-044 / RM-046).
  • OpenAI credit monitor (KI-046) — before AI ships (Phase 3).
  • Gotify broken (KI-022) — email-only alerting.
  • M365 mailbox + SPF/DKIM/DMARC for support@eidos-global.com — needs Stacy/Adam.
  • GitLab read-only deploy keys per repo for RCA — needs Sergiu.
  • Vault reachability + resilience — n8n on E2 must reach vault.448.global (on O1); Vault's lack of backup and O1-SPOF placement is an inherited risk (see backup posture).

Decisions locked (2026-06-02)

  1. Atlassian SA token still valid — not a hard gate; paid-plan migration is the worst-case fallback.
  2. Unify on Postgres (bugsys); drop n8n Data Tables for this system.
  3. Jira Software only, no JSM/new licensing; priority via simple AI + user-input math.
  4. Proxy-via-n8n attachments, stream-through so n8n never holds blob bytes.
  5. Business hours a per-service config; default UK only (UK+India optional).
  6. GDPR in scope; attachments auto-delete at 2 years.
  7. Durable buffer = write-first in Postgres; Postgres + Jira are the manual-fallback sources of truth.
  8. External identity / chatbot auth = 5-minute email OTP + CAPTCHA, state in Postgres (no Authentik).
  9. Parallax first, like-for-like replacement (keeps a fallback), then fan out.
  10. Chatbot = n8n AI Agent + custom chat form (OTP + Altcha), not Open WebUI/Authentik.
  11. Secrets in Vault (vault.448.global), read by n8n at runtime.
  12. CAPTCHA = Altcha (open-source, self-hosted, proof-of-work, least maintenance).
  13. Rate-limit defaults (industry-standard, per-service configurable): per-IP 5/min + 30/hour, per-appKey 500/day, global ~2,000/day trip-wire alert.

References