Hallucination-Aware Hybrid LLM System
Production-Grade RAG with Phi-3 and FAISS Retrieval
Production-Grade RAG with Phi-3 and FAISS Retrieval
Overview
Production-ready Retrieval-Augmented Generation (RAG) system that prevents hallucinations through strict context-grounding. Responses are retrieved from a FAISS vector index and verified against source documents before returning to users.
Features two deployment modes: Lightweight Mode (fast retrieval + template generation, 20-50ms) and Full Mode (Phi-3 LoRA with cross-encoder reranking, 100-500ms). Current operational status: 🟢 86.7% retrieval accuracy on test queries.
System Architecture
Visual System Architecture
Complete RAG pipeline from client request to response with hallucination prevention mechanisms

Data Flow Breakdown
Key Features
FAISS Vector Retrieval
Fast, accurate document similarity search on normalized embeddings with inner-product similarity for stable retrieval.
Dual Deployment Modes
Lightweight mode for fast template generation (20-50ms) or Full mode with Phi-3 reasoning (100-500ms) based on hardware availability.
Hallucination Guards
Multi-level safeguards: retrieval constraints + prompt engineering + token overlap verification to prevent unreliable outputs.
📊Production Features
Rate limiting, API key auth, structured logging with request IDs, Prometheus metrics, and comprehensive evaluation framework.
Hallucination Prevention Strategy
1️⃣ Retrieval Constraint
Query embedded using SentenceTransformer (all-MiniLM-L6-v2), top-K relevant documents retrieved via FAISS inner-product search. Only retrieved documents passed as context to LLM.
No parametric knowledge allowed during generation.
2️⃣ Prompt-Level Generation Constraints
Strict instructions in system prompt: Answer ONLY using provided context, do NOT use prior knowledge, if answer not in context respond exactly: "Not found in retrieved documents"
3️⃣ Token Overlap Verification
Generated answer verified against retrieved documents. If below threshold overlap, forces abstention. Citations track exact document spans used.
Evaluation Results
86.7%
Success Rate
13/15 correct retrievals
100%
Precision
Retrieved match keywords
20-50ms
Lightweight Mode
Fast retrieval
100-500ms
Full Mode
With Phi-3 generation