SentinelRAG is a production-grade Retrieval-Augmented Generation (RAG) system engineered from first principles to demonstrate how enterprise AI retrieval pipelines should operate—transparent, measurable, and grounded in real data rather than prompt tricks. The system spans a three-service microservices architecture (Python worker, Node.js API, and Next.js UI) and implements a full ingestion → chunking → indexing → hybrid retrieval → reranking → answer synthesis pipeline, deployed securely on GCP Cloud Run with PostgreSQL (pgvector), Redis, and Docker-based CI/CD.
Custom semantic chunking pipeline with token-aware overlap, paragraph preservation, metadata enrichment, and Redis-cached embedding generation.
Hybrid retrieval engine combining PostgreSQL BM25 keyword search (tsvector) and pgvector cosine similarity, merging and deduplicating candidates for high recall.
LLM reranking that batch-scores ~30 retrieved chunks down to the top 5–8 for precision, reducing hallucination risk by enforcing relevance-based selection.
Strict answer synthesis that only allows the LLM to answer from retrieved context; includes explicit refusal mode and full source-chunk attribution.
Deep observability with per-stage latency budgets, hit-rate analytics, execution traces, token usage, and full retrieval diagnostics for debugging and optimization.
Fully containerized deployment using Docker + Cloud Build CI/CD, VPC-secured connections to PostgreSQL and Redis, and sub-200ms retrieval performance under load.
This project showcases real-world RAG engineering—explicit, inspectable, and built for production correctness rather than demo-level shortcuts.
Built with