Code Icon
Case Study

Discovery prioritized cited answers over static reporting

Overview

Discovery showed analysts needed fast, trustworthy citations—not another portal. We turned 100–200 page oncology reports into conversational, sourced answers with direct links to the original pages.

We built a RAG pipeline that ingests PDFs and Excel, indexes knowledge in Postgres + pgvector, and returns grounded responses via o4‑mini—complete with citations so teams can verify claims instantly.

Architecture & Design
  • Ingestion: PDF parsing with table extraction and Excel normalization into typed records.
  • Chunking: Semantic chunking with overlap sized for oncology abstracts, tables, and figure captions.
  • Embeddings: Sentence transformers generate dense vectors; metadata captured for source, page, section, and entity types.
  • Index: Postgres + pgvector for similarity search with IVF/flat indexes tuned for latency and recall.
  • Retrieval: Hybrid scoring (BM25 + vector) and MMR to balance relevance and diversity; optional cross‑encoder re‑rank.
  • Generation: o4‑mini with structured prompt templates, citation injection, and JSON‑typed outputs.
  • Grounding UX: Click‑through citations to the exact PDF page/section; expandable context snippets.
Data Pipeline Details
  • Document Processing: PDF text, tables, and references tokenized; domain heuristics preserve drug names, trial phases, and targets.
  • Entity Enrichment: Lightweight NER normalizes drug names, indications, targets, and sponsors to stable identifiers for joinability.
  • Deduplication: Near‑duplicate detection on abstracts and press releases prevents echoing the same fact across sources.
  • Temporal Context: Report date and conference session metadata enable longitudinal queries across editions.
Privacy, Security & Compliance
  • Least‑Privilege & RBAC: Role‑scoped access to reports and embeddings; secrets loaded via environment variables.
  • Encryption: TLS in transit; encrypted Postgres at rest with separate KMS‑managed keys.
  • Data Minimization: We store embeddings and essential metadata only; no raw PHI/PII persisted beyond processing needs.
  • Auditability: Request/response traces and retrieval logs support SOC 2 controls for change management and access reviews.
Observability & Quality
  • Tracing: Token‑level traces capture retrieval sets, prompt variants, and model outputs for debugging and regression checks.
  • Evals: Task‑oriented evals (answer faithfulness, citation correctness, entity accuracy) guide prompt and retrieval tuning.
  • Feedback Loops: Analysts can flag answers; feedback feeds re‑ranking data and prompt adjustments.
Results
  • Faster Insight: Clients locate drug‑specific intelligence in seconds instead of scanning hundreds of pages.
  • Trust via Citations: One‑click access to source PDFs increased confidence and adoption across stakeholders.
  • Longitudinal Q&A: Users ask cross‑report questions (e.g., changes across conferences) with consistent answers.
Stack Highlights
  • OpenAI SDK with o4‑mini for grounded, structured generation
  • Postgres + pgvector for vector search and metadata filtering
  • Sentence‑transformer embeddings and optional cross‑encoder re‑rank
  • Typed ETL for PDFs and Excel; normalized oncology entities
  • Tracing, evals, and feedback for continuous quality