1. Project Objective

Build a Conversational Assistant (CA) specialized in biological discovery that allows researchers and technical staff to query biomedical data and literature in natural language, obtain reliable summaries, cited references, visualizations (pathways, networks), and generate exploratory hypotheses. The CA will be RAG-based (retrieval-augmented generation) to minimize “hallucinations” and always provide sources.

2. Research Questions (Hypotheses)

An LLM + RAG that queries public repositories (PubMed, NCBI, UniProt, Reactome, GEO) can answer molecular biology questions with ≥ 85% factual accuracy as measured by expert evaluators.
Integrating ontologies (Gene Ontology, MeSH, UniProt IDs) improves retrieval relevance by ≥ 15% compared to simple text-based searches.
Showing evidence (snippets and links) reduces the perceived distrust rate among users compared to responses without sources.

3. Scope (MVP vs later phases)

MVP (3 months):

Answers to basic queries: genes, diseases, metabolic pathways, relevant publications, related public datasets.
Primary sources: PubMed + NCBI Gene + UniProt + Reactome.
Functionality: web chat, RAG with vector DB, cited results, confidence ranking, simple report export (PDF/CSV).

Phase 2 (3–9 months):

Integration of GEO (expression), ClinVar (variants), interactive visualizations (networks and pathways), authentication, conversation history, human-in-the-loop feedback.

Phase 3 (9–18 months):

Support for reproducible workflows (e.g., requests that launch RNA-seq analysis pipelines), integration with lab/partners, training modules with private data (multi-tenant).

4. Data Sources and Prioritization (MVP first)

High priority (MVP):

PubMed (abstracts, metadata)
NCBI Gene / Entrez / RefSeq
UniProt (proteins, functions)
Reactome / KEGG (pathways)
DOIs and article metadata (CrossRef)

Medium priority (Phase 2):

GEO (gene expression), dbSNP, ClinVar, PDB/AlphaFold (structures), PubChem (compounds)

Annotations / ontologies:

Gene Ontology (GO), MeSH, UMLS / SNOMED (if clinical domain needed)

5. Proposed Technical Architecture (high-level)

Ingestion and normalization
- Crawlers/ETL pipelines to extract metadata and text (abstracts, titles, sequences when applicable).
- Normalization of identifiers (Ensembl/NCBI/UniProt), ontology mapping.
Indexing and vectorization
- Create semantic embeddings (per chunk) for abstracts, gene descriptions, pathways, UniProt entries.
- Store embeddings in a vector DB (Qdrant / Milvus / Weaviate / Pinecone).
Retrieval system (RAG)
- Similarity-based retrieval + filters by source/date/organism.
- Ranking pipeline: BM25/lexical + embedding score fusion.
Generation / LLM
- Base model for generation (e.g., Llama2/3 fine-tuned or OpenAI API) with prompt templates including: retrieved context, cited snippets, and response structure.
- Verification module: heuristics/models to detect contradictions; return “I don’t know” or only extracts when confidence is low.
Evidence and UI layer
- Display retrieved snippets with links (source-first design).
- Visualizations: metabolic pathways (React + D3 / Cytoscape.js), co-occurrence graphs.
Telemetry & Feedback
- Log queries, satisfaction rate, expert correction markers for retraining.

6. Key Components / Suggested Stacks

Backend: Node.js / Python (FastAPI)
Vector DB: Qdrant or Milvus (open source) or Pinecone (hosted)
Embeddings: OpenAI embeddings or open-source models (e.g., sentence-transformers)
LLM: OpenAI API (for prototyping) or Llama 2/3 / Mistral local for cost control at scale
Conventional DB: PostgreSQL (metadata, users)
Orchestration ETL: Airflow / Dagster / simple cron + Python scripts for MVP
Frontend: React (dashboard + chat), Cytoscape.js for networks
Infra: Cloud (GCP/Azure/AWS). For BEX in Peru, GCP is a good choice (you’re already using Google Cloud).
CI/CD: GitHub Actions / GitLab CI
Observability: Sentry, Prometheus + Grafana (optional)

7. Conversation Flow Design (example)

User: “What genes are associated with dilated cardiomyopathy and which recent papers support them?”

Preprocessing: detect entity (dilated cardiomyopathy) and species (human).
Retrieval: query PubMed (last 5 years), ClinVar, NCBI Gene; retrieve top-k snippets + metadata.
Rank results and extract evidence.
LLM prompt: “Respond in 3 paragraphs: short summary, list 5 genes with evidence (paper + year), limitations + links.”
Response: includes [1]–[5] with links and confidence score (e.g., 0.87).
UI shows “see more” to open abstracts or download report.

8. Evaluation / Research Metrics

Factuality: % of correct answers according to expert panel (gold standard).
Recall/Precision@k: on a test set of queries with expected answers.
Latency: <2s retrieval + <4s response in MVP.
User satisfaction: NPS/1–5 scale.
Hallucination rate: % of claims not verifiable by cited fragments (goal <10% MVP).
Source coverage: % of queries with at least one relevant source returned.

Benchmark: ~200 real-world questions (consultants + researchers), evaluated in rounds: baseline (lexical search), RAG with embeddings, RAG + ontology.

9. Suggested Timeline (MVP in 12 weeks — 2-week sprints)

Sprint 0 (1 week) — Kickoff

Define MVP scope, KPIs, dataset access, basic infra.

Sprint 1 (2 weeks)

Minimal ingestion pipeline: PubMed + NCBI Gene + UniProt (metadata + abstracts).
Normalize IDs and store metadata.

Sprint 2 (2 weeks)

Implement embeddings + vector DB; index initial data.
Basic retrieval API.

Sprint 3 (2 weeks)

Integrate LLM (API) and design RAG prompts. End-to-end test locally.
Build basic frontend (chat).

Sprint 4 (2 weeks)

Run evaluation with 50 test queries. Tune prompts/recall.
Show citations + confidence scoring in UI.

Sprint 5 (2 weeks)

Improve UI (basic pathway/entity visualization). Telemetry & feedback mechanism.
Internal user tests (3–5 researchers); collect feedback.

Sprint 6 (1 week)

Documentation, stakeholder demo, Phase 2 plan.

10. Minimum Recommended Team (MVP)

1 Project lead / data scientist (bioinformatics) — domain + evaluation.
1 ML engineer / MLOps — embeddings, vector DB, pipelines.
1 Backend dev (Python/FastAPI).
1 Frontend dev (React + visualizations).
1 Domain expert (molecular biology/PI) part-time for review and evaluation.

Phase 2+: ontology specialist, QA, DevOps, legal/ethics lead if clinical data is integrated.

11. Resource / Cost Estimate (indicative)

MVP 3 months (contracted staff / cloud costs):

Personnel (5 roles, 3 months): US$40k–80k depending on local/remote rates.
Infra + APIs (vector DB hosted, LLM API): US$2k–10k (highly variable by volume).
Other (licenses, extra datasets): US$0–5k.

Indicative MVP total: US$45k–95k (lower if local salaries + open-source stack).

12. Risks and Mitigations

LLM hallucinations → Mitigation: strict RAG + fallback to only returning fragments; “No evidence available” as valid output.
Outdated data → Mitigation: timestamped indexing, filter by publication date.
Heterogeneous IDs → Mitigation: early normalization + mapping tables.
Privacy/compliance (if clinical data used) → Mitigation: encrypt PII, comply with HIPAA / GDPR-like, agreements with providers.
Scalability → Mitigation: modular design, horizontally scalable vector DB.

13. Ethics and Governance

Visible disclaimer “Limitations and appropriate use” in UI.
Always log and cite sources.
Keep logs for auditing and model improvement (consent for private data).
Content review board for sensitive findings.

14. Expected Deliverables (MVP)

Functional demo: web chat answering basic scientific queries with citations.
Ingestion, indexing, retrieval code (repo).
Set of 200 validation questions + results/analytics.
Technical documentation + operations guide.
Roadmap for Phase 2 with detailed budget.

15. Example Prompt (Template) — for RAG LLM

(system prompt / wrapper, not shown to user)

You are an expert assistant in molecular biology. You are provided with: 
1) a user’s question,
2) N evidence snippets extracted from databases (each with source + year),
3) instructions.

Goal: respond briefly (max 250 words) with accuracy. Structure:
- Summary (1–2 sentences),
- Numbered list of findings / genes / entities with 1-line evidence + citation [Source, Year],
- Limitations / next steps (1 sentence).

If evidence does not support the claim, respond: "There is not enough evidence in the indexed sources" and provide what is available. Do not invent studies or attribute unsupported conclusions.

16. Scientific Evaluation Plan (Research)

Publish paper/tech report comparing: baseline (BM25), RAG with embeddings, RAG+ontology on biomedical Q&A tasks.
Metrics: precision@k, factuality (expert judgment), fluency (automatic), query response time.
Evaluation dataset: co-developed with academics, 300 annotated questions with gold answers + sources.

17. Extensions / Research Roadmap

Fine-tune LLM on biomedical literature to reduce errors.
Add reasoning module with chains of evidence (controlled chain-of-thought), hidden from user, with explanations.
Include numeric analysis (e.g., run differential expression in GEO on demand) via secure containerized pipelines.
Multi-language support (Spanish for LATAM users).

18. Recommended Next Steps (Immediate Action)

Approve MVP scope and assign minimum team.
Set up access to APIs/datasets (NCBI/PubMed credentials, choose vector DB).
Implement Sprint 0 + Sprint 1 (first 3–4 weeks) — ingestion + indexing pipeline.
Prepare initial set of 50–100 real user questions for scientific evaluation.

Conversational Knowledge Assistant for Life Sciences