case_studies

Architectural breakdowns from production engagements.

Each study documents a real production pattern: the constraint we inherited, the system we designed, and the measurable outcome after launch. Client names are anonymized; infrastructure details reflect actual deployment topologies.

[fintech-rag]

Real-Time Compliance Q&A for a Regulated Lender

Financial Services

FastAPI
pgvector
Cohere Rerank
AWS VPC
OpenTelemetry

The Challenge

Compliance analysts relied on a monolithic prompt chain over ad-hoc PDF exports, producing 890ms p99 latency and hallucinated citations under peak load. Audit requirements demanded source-attributed answers with immutable retrieval logs, but the existing stack had no ACL-aware chunking or rerank stage—every query scanned the full document corpus.

The Architecture

We deployed a VPC-bound RAG pipeline: documents ingested via EventBridge triggers into S3, chunked with layout-aware parsers, embedded with text-embedding-3-large, and stored in RDS Postgres with pgvector HNSW indexes. Queries hit a FastAPI gateway that applies department-level ACL filters before hybrid retrieval (dense + BM25), then cross-encoder reranking via Cohere. Responses stream with citation spans linked to chunk UUIDs; all retrieval decisions append to an append-only audit table.

The Output

p99 query latency dropped from 890ms to 210ms at 450 concurrent users. Citation accuracy on the golden eval set reached 96.2%, up from 71%. Monthly inference spend fell 28% after semantic cache warm-up on the top 120 query templates.

210msp99 latency

96.2%citation accuracy

−28%cost reduction

[logistics-agents]

Multi-Agent Dispatch Reconciliation

Logistics and Supply Chain

Kafka
Temporal
Python
Redis
Grafana

The Challenge

Dispatch coordinators manually reconciled shipment exceptions across three legacy TMS exports, averaging 14 minutes per case with no audit trail. A prior chatbot demo failed in production because tool calls were unconstrained—agents issued warehouse API writes without idempotency keys, causing duplicate pallet moves during Kafka consumer lag spikes.

The Architecture

We introduced an event-driven orchestration layer: TMS webhooks publish to Kafka topics partitioned by region. A Temporal workflow engine coordinates planner, validator, and executor agents with explicit state transitions. The validator agent checks proposed actions against a rules DSL compiled from operations policy; only passing actions reach executor workers that call TMS APIs with idempotency keys stored in Redis. Human approval gates pause workflows exceeding $50k impact thresholds.

The Output

The system processes 12,400 reconciliation workflows per day with 99.4% successful completion. Median case resolution time fell from 14 minutes to 2.3 minutes. Duplicate write incidents dropped to zero over a 90-day production window.

12.4kworkflows/day

2.3 minmedian resolution

0duplicate writes

[saas-inference]

Multi-Tenant LLM Gateway for a B2B SaaS Platform

Enterprise SaaS

Kubernetes
vLLM
Redis
Prometheus
Helm

The Challenge

The platform routed all tenant traffic through a single OpenAI proxy with shared API keys, yielding unpredictable token bills and 1.8s p99 latency during batch reporting windows. No per-tenant quotas existed; one customer’s fine-tune evaluation jobs starved interactive users. Leadership required sub-500ms p99 for chat features while cutting blended token cost by 35%.

The Architecture

We built a multi-tenant gateway on Kubernetes: vLLM serves a distilled 8B model on a dedicated GPU node pool with KEDA scaling on queue depth. A Redis semantic cache keys responses by normalized prompt hash plus tenant ID. The gateway enforces per-tenant RPM and token budgets, routes overflow to a higher-latency fallback provider, and exports RED metrics to Prometheus with burn-rate alerts on error budget consumption.

The Output

Interactive chat p99 latency stabilized at 420ms under peak load. Blended token cost per request decreased 41% via cache hits and model routing. Tenant fairness incidents—where one customer starved others—were eliminated after quota enforcement shipped.

420msp99 latency

−41%token cost

99.97%uptime

Building something similar?

We scope engagements from architecture review through 0-to-1 production delivery. Share your constraints and we will map a phased plan.

Review Services Schedule Architecture Call