Case Study

Discovery prioritized cited answers over static reporting

Overview

Discovery showed analysts needed fast, trustworthy citations—not another portal. We turned 100–200 page oncology reports into conversational, sourced answers with direct links to the original pages.

We built a RAG pipeline that ingests PDFs and Excel, indexes knowledge in Postgres + pgvector, and returns grounded responses via o4‑mini—complete with citations so teams can verify claims instantly.

Architecture & Design

Ingestion: PDF parsing with table extraction and Excel normalization into typed records.
Chunking: Semantic chunking with overlap sized for oncology abstracts, tables, and figure captions.
Embeddings: Sentence transformers generate dense vectors; metadata captured for source, page, section, and entity types.
Index: Postgres + pgvector for similarity search with IVF/flat indexes tuned for latency and recall.
Retrieval: Hybrid scoring (BM25 + vector) and MMR to balance relevance and diversity; optional cross‑encoder re‑rank.
Generation: o4‑mini with structured prompt templates, citation injection, and JSON‑typed outputs.
Grounding UX: Click‑through citations to the exact PDF page/section; expandable context snippets.

Data Pipeline Details

Document Processing: PDF text, tables, and references tokenized; domain heuristics preserve drug names, trial phases, and targets.
Entity Enrichment: Lightweight NER normalizes drug names, indications, targets, and sponsors to stable identifiers for joinability.
Deduplication: Near‑duplicate detection on abstracts and press releases prevents echoing the same fact across sources.
Temporal Context: Report date and conference session metadata enable longitudinal queries across editions.

Privacy, Security & Compliance

Least‑Privilege & RBAC: Role‑scoped access to reports and embeddings; secrets loaded via environment variables.
Encryption: TLS in transit; encrypted Postgres at rest with separate KMS‑managed keys.
Data Minimization: We store embeddings and essential metadata only; no raw PHI/PII persisted beyond processing needs.
Auditability: Request/response traces and retrieval logs support SOC 2 controls for change management and access reviews.

Observability & Quality

Tracing: Token‑level traces capture retrieval sets, prompt variants, and model outputs for debugging and regression checks.
Evals: Task‑oriented evals (answer faithfulness, citation correctness, entity accuracy) guide prompt and retrieval tuning.
Feedback Loops: Analysts can flag answers; feedback feeds re‑ranking data and prompt adjustments.

Results

Faster Insight: Clients locate drug‑specific intelligence in seconds instead of scanning hundreds of pages.
Trust via Citations: One‑click access to source PDFs increased confidence and adoption across stakeholders.
Longitudinal Q&A: Users ask cross‑report questions (e.g., changes across conferences) with consistent answers.

Stack Highlights

OpenAI SDK with o4‑mini for grounded, structured generation
Postgres + pgvector for vector search and metadata filtering
Sentence‑transformer embeddings and optional cross‑encoder re‑rank
Typed ETL for PDFs and Excel; normalized oncology entities
Tracing, evals, and feedback for continuous quality