Case Study20254 min read

Hybrid RAG Architecture for Multilingual Financial Document Intelligence

How a bilingual retrieval system collapsed a 2–3 day manual deal-research workflow into grounded answers in under two seconds.

Context: AgileCatalyst.ai · Stealth GCC Fintech ($200M fund)
Role: AI Product Design & Delivery
Scale: $200M fund · 100–200 docs/company · multi-tenant · English + Arabic
Timeframe: 2025–2026

MRR 1.0 retrieval (EN + AR) · 42ms retrieval · <1.2s end-to-end · 95%/90%+ OCR

GenAI
Infrastructure Economics
Rag
Multilingual Ai

"Isolation you can argue with in code is not isolation. Make it a property of the architecture."— AgileCatalyst.ai · Engineering Field Note

The Challenge

AgileCatalyst.ai serves a stealth GCC-based fintech startup running a dedicated $200M investment fund, where investment analysts evaluate private-company pitch decks and financial documents to drive capital deployment. The core workflow required analysts to manually read through unstructured PDFs — many of them in Arabic — to extract deal-relevant signals. That manual research and extraction phase took two to three days per deal and became a hard bottleneck as deal volume scaled — friction sitting directly on the critical path of a nine-figure capital base.

Three compounding problems defined the brief:

Document heterogeneity — scanned PDFs, digital PDFs, cap tables, financial models, and mixed-language decks, all in one knowledge base.
Arabic language complexity — right-to-left rendering and a morphologically rich script that breaks standard embedding models.
Multi-tenant isolation — a marketplace under strict client data segregation, where Client A must never be able to reach Client B's documents.

The Solution

We designed and deployed a production Hybrid RAG (Retrieval-Augmented Generation) system that lets analysts query their document knowledge base in natural language — English or Arabic — and receive grounded, sub-second answers. It was built ground-up with bilingual retrieval, asynchronous ingestion, and hard architectural isolation as first-class constraints rather than afterthoughts.

The production pipeline

Stage	Mechanism	Why it matters
01 · Ingestion	Upload → `/kb_train` endpoint → async background job → immediate ACK	Redesigned from sync to async after containers crashed on 50MB+ files. Unlocked production stability.
02 · OCR + Parsing	Gemini Pro (digital) + Tesseract (scanned) + PyMuPDF (structured) + Arabic Reshaper / BiDi	95% EN, 90%+ AR accuracy, with custom handling for cap tables and financial row/column structures.
03 · Chunking	Header-based + recursive + custom logic → ~500-token chunks	Custom chunker preserves table structure and prevents cross-row semantic bleed.
04 · Embeddings	paraphrase-multilingual-MiniLM-L12-v2 → Qdrant (company-scoped filter)	Benchmarked 20+ models. Previous model: MRR 0.12 on Arabic. This one: MRR 1.0.
05 · Hybrid Retrieval	Dense (vector) + Sparse (BM25) → Reciprocal Rank Fusion → Top-K	BM25 catches Arabic morphology dense embeddings miss. Combined: MRR 1.0 at 42ms.
06 · Generation	Top-K chunks → 3-prompt system → GPT-4o → grounded answer	FIFO memory (3 exchanges), faithfulness scorer, context-window guard, and cost-spike alerts.

Technology stack

Layer	Choice
OCR	Gemini Pro + Tesseract + PyMuPDF; Arabic Reshaper + BiDi
Chunking	Hybrid (header + recursive + custom), ~500 tokens
Embeddings	paraphrase-multilingual-MiniLM-L12-v2
Vector DB	Qdrant, company-scoped filter (hard isolation)
Retrieval	Hybrid BM25 + Dense, merged via Reciprocal Rank Fusion
LLM	GPT-4o, 3-prompt system, FIFO memory (3 exchanges)
Infra	FastAPI on AWS App Runner, Vite / AWS Amplify, MongoDB, Docker

Key Engineering Decisions

1 · Async ingestion. Synchronous ingestion crashed containers on files over 50MB. Moving to async with immediate acknowledgment and background processing resolved the instability and enabled large ingestion sessions — 10–20 files per batch, 100–200 documents per company knowledge base.

2 · Embedding-model selection. Arabic retrieval quality was the defining bottleneck. After benchmarking 20+ models, paraphrase-multilingual-MiniLM-L12-v2 reached MRR 1.0 on both English and Arabic test sets — up from MRR 0.12. This was the single highest-leverage technical decision in the project.

3 · Hybrid retrieval over dense-only. Dense vector search underperformed on Arabic queries because of morphological complexity. BM25 sparse retrieval captures the exact and near-exact keyword matches embeddings miss; Reciprocal Rank Fusion merges both result sets without a reranker, holding latency at 42ms while achieving MRR 1.0.

4 · Hard multi-tenant isolation. Client data isolation was non-negotiable. Company-scoped metadata filters are enforced as an architectural constraint at the Qdrant layer — not in application logic — so no cross-client document leakage is possible regardless of how a query is constructed.

Results and Impact

Metric	Outcome
Deal research & extraction	2–3 days → grounded answers in under 2 seconds
Retrieval quality (MRR)	1.0 on both English and Arabic — full bilingual parity
Retrieval latency	42ms
End-to-end response	under 1.2 seconds in production
OCR accuracy	95% English · 90%+ Arabic
Chunk relevancy	80% across retrieved context
Production stability	zero container crashes post async redesign
Client isolation	zero cross-tenant data-exposure incidents

Figures are normalized to US-market engineering costs and typical portfolio-company scale, drawn from the S+3 Agile field record.

← Back to the library