Hallucination-Aware Hybrid LLM System

Production-Grade RAG with Phi-3 and FAISS Retrieval

RAGPhi-3FAISSLoRAFastAPISentenceTransformersCross-EncoderPrometheusDocker

Overview

Production-ready Retrieval-Augmented Generation (RAG) system that prevents hallucinations through strict context-grounding. Responses are retrieved from a FAISS vector index and verified against source documents before returning to users.

Features two deployment modes: Lightweight Mode (fast retrieval + template generation, 20-50ms) and Full Mode (Phi-3 LoRA with cross-encoder reranking, 100-500ms). Current operational status: 🟢 86.7% retrieval accuracy on test queries.

System Architecture

Visual System Architecture

Complete RAG pipeline from client request to response with hallucination prevention mechanisms

RAG System Architecture - End-to-end data flow

Data Flow Breakdown

Client Request → FastAPI /query endpoint

↓ Middleware: Auth + Rate Limiting + Logging

↓ Async Inference Queue (bounded concurrency)

↓ RAG Pipeline:

├─ Load FAISS index (LRU cache)

├─ Embed query (SentenceTransformer, cached)

├─ Retrieve top-K documents (<1ms)

├─ Optional: Cross-encoder reranking

├─ Budget context to MAX_CONTEXT_CHARS

├─ [LIGHTWEIGHT] Template-based answer extraction

└─ [FULL] Phi-3 LoRA generation (100-500ms)

↓ Citation generation (span-level grounding)

↓ Hallucination guard (token overlap verification)

↓ JSON response with metrics to client

Key Features

FAISS Vector Retrieval

Fast, accurate document similarity search on normalized embeddings with inner-product similarity for stable retrieval.

Dual Deployment Modes

Lightweight mode for fast template generation (20-50ms) or Full mode with Phi-3 reasoning (100-500ms) based on hardware availability.

Hallucination Guards

Multi-level safeguards: retrieval constraints + prompt engineering + token overlap verification to prevent unreliable outputs.

📊Production Features

Rate limiting, API key auth, structured logging with request IDs, Prometheus metrics, and comprehensive evaluation framework.

Hallucination Prevention Strategy

1️⃣ Retrieval Constraint

Query embedded using SentenceTransformer (all-MiniLM-L6-v2), top-K relevant documents retrieved via FAISS inner-product search. Only retrieved documents passed as context to LLM.

No parametric knowledge allowed during generation.

2️⃣ Prompt-Level Generation Constraints

Strict instructions in system prompt: Answer ONLY using provided context, do NOT use prior knowledge, if answer not in context respond exactly: "Not found in retrieved documents"

3️⃣ Token Overlap Verification

Generated answer verified against retrieved documents. If below threshold overlap, forces abstention. Citations track exact document spans used.

Evaluation Results

86.7%

Success Rate

13/15 correct retrievals

100%

Precision

Retrieved match keywords

20-50ms

Lightweight Mode

Fast retrieval

100-500ms

Full Mode

With Phi-3 generation

Live API Testing Examples

"What is the rate limit?"✅ 20-50ms

"What encryption is used?"✅ 20-50ms

Tech Stack

LLM & Retrieval

Phi-3 MiniFAISS 1.8SentenceTransformerCrossEncoderLoRA (PEFT)HuggingFace

Backend & API

FastAPI 0.111UvicornPydanticasyncioSlowAPI

Observability & Deployment

PrometheusDockerdocker-composeStreamlitpytest

Implementation Highlights

▹Normalized Embeddings: Inner-product similarity instead of L2 distance for improved retrieval stability

▹Context Budgeting: 4000-char limit prevents token overflow in model context window

▹LRU Caching: Models, embeddings, and reranker scores cached for 2048+ entries for performance

▹Async Inference Queue: Bounded concurrency with configurable worker threads for stable performance

▹Comprehensive Logging: Structured JSON logging with request IDs for debugging and monitoring

▹Docker-Ready: Separate API + UI services with docker-compose for easy deployment

Future Improvements

Adaptive similarity threshold for abstention

Fine-tuned domain-specific embedder

Explicit "not in corpus" detection

Confidence scoring per answer

Knowledge graph-based retrieval

Streaming response support