Real-Time Compliance Q&A for a Regulated Lender
Financial Services
- FastAPI
- pgvector
- Cohere Rerank
- AWS VPC
- OpenTelemetry
The Challenge
Compliance analysts relied on a monolithic prompt chain over ad-hoc PDF exports, producing 890ms p99 latency and hallucinated citations under peak load. Audit requirements demanded source-attributed answers with immutable retrieval logs, but the existing stack had no ACL-aware chunking or rerank stage—every query scanned the full document corpus.
The Architecture
We deployed a VPC-bound RAG pipeline: documents ingested via EventBridge triggers into S3, chunked with layout-aware parsers, embedded with text-embedding-3-large, and stored in RDS Postgres with pgvector HNSW indexes. Queries hit a FastAPI gateway that applies department-level ACL filters before hybrid retrieval (dense + BM25), then cross-encoder reranking via Cohere. Responses stream with citation spans linked to chunk UUIDs; all retrieval decisions append to an append-only audit table.
The Output
p99 query latency dropped from 890ms to 210ms at 450 concurrent users. Citation accuracy on the golden eval set reached 96.2%, up from 71%. Monthly inference spend fell 28% after semantic cache warm-up on the top 120 query templates.