Hybrid RAG Architecture for Multilingual Financial Document Intelligence
How a bilingual retrieval system collapsed a 2–3 day manual deal-research workflow into grounded answers in under two seconds.
"Isolation you can argue with in code is not isolation. Make it a property of the architecture."— AgileCatalyst.ai · Engineering Field Note
The Challenge
AgileCatalyst.ai serves a stealth GCC-based fintech startup running a dedicated $200M investment fund, where investment analysts evaluate private-company pitch decks and financial documents to drive capital deployment. The core workflow required analysts to manually read through unstructured PDFs — many of them in Arabic — to extract deal-relevant signals. That manual research and extraction phase took two to three days per deal and became a hard bottleneck as deal volume scaled — friction sitting directly on the critical path of a nine-figure capital base.
Three compounding problems defined the brief:
- Document heterogeneity — scanned PDFs, digital PDFs, cap tables, financial models, and mixed-language decks, all in one knowledge base.
- Arabic language complexity — right-to-left rendering and a morphologically rich script that breaks standard embedding models.
- Multi-tenant isolation — a marketplace under strict client data segregation, where Client A must never be able to reach Client B's documents.
The Solution
We designed and deployed a production Hybrid RAG (Retrieval-Augmented Generation) system that lets analysts query their document knowledge base in natural language — English or Arabic — and receive grounded, sub-second answers. It was built ground-up with bilingual retrieval, asynchronous ingestion, and hard architectural isolation as first-class constraints rather than afterthoughts.
The production pipeline
| Stage | Mechanism | Why it matters |
|---|---|---|
| 01 · Ingestion | Upload → /kb_train endpoint → async background job → immediate ACK | Redesigned from sync to async after containers crashed on 50MB+ files. Unlocked production stability. |
| 02 · OCR + Parsing | Gemini Pro (digital) + Tesseract (scanned) + PyMuPDF (structured) + Arabic Reshaper / BiDi | 95% EN, 90%+ AR accuracy, with custom handling for cap tables and financial row/column structures. |
| 03 · Chunking | Header-based + recursive + custom logic → ~500-token chunks | Custom chunker preserves table structure and prevents cross-row semantic bleed. |
| 04 · Embeddings | paraphrase-multilingual-MiniLM-L12-v2 → Qdrant (company-scoped filter) | Benchmarked 20+ models. Previous model: MRR 0.12 on Arabic. This one: MRR 1.0. |
| 05 · Hybrid Retrieval | Dense (vector) + Sparse (BM25) → Reciprocal Rank Fusion → Top-K | BM25 catches Arabic morphology dense embeddings miss. Combined: MRR 1.0 at 42ms. |
| 06 · Generation | Top-K chunks → 3-prompt system → GPT-4o → grounded answer | FIFO memory (3 exchanges), faithfulness scorer, context-window guard, and cost-spike alerts. |
Technology stack
| Layer | Choice |
|---|---|
| OCR | Gemini Pro + Tesseract + PyMuPDF; Arabic Reshaper + BiDi |
| Chunking | Hybrid (header + recursive + custom), ~500 tokens |
| Embeddings | paraphrase-multilingual-MiniLM-L12-v2 |
| Vector DB | Qdrant, company-scoped filter (hard isolation) |
| Retrieval | Hybrid BM25 + Dense, merged via Reciprocal Rank Fusion |
| LLM | GPT-4o, 3-prompt system, FIFO memory (3 exchanges) |
| Infra | FastAPI on AWS App Runner, Vite / AWS Amplify, MongoDB, Docker |
Key Engineering Decisions
1 · Async ingestion. Synchronous ingestion crashed containers on files over 50MB. Moving to async with immediate acknowledgment and background processing resolved the instability and enabled large ingestion sessions — 10–20 files per batch, 100–200 documents per company knowledge base.
2 · Embedding-model selection. Arabic retrieval quality was the defining bottleneck. After benchmarking 20+ models, paraphrase-multilingual-MiniLM-L12-v2 reached MRR 1.0 on both English and Arabic test sets — up from MRR 0.12. This was the single highest-leverage technical decision in the project.
3 · Hybrid retrieval over dense-only. Dense vector search underperformed on Arabic queries because of morphological complexity. BM25 sparse retrieval captures the exact and near-exact keyword matches embeddings miss; Reciprocal Rank Fusion merges both result sets without a reranker, holding latency at 42ms while achieving MRR 1.0.
4 · Hard multi-tenant isolation. Client data isolation was non-negotiable. Company-scoped metadata filters are enforced as an architectural constraint at the Qdrant layer — not in application logic — so no cross-client document leakage is possible regardless of how a query is constructed.
Results and Impact
| Metric | Outcome |
|---|---|
| Deal research & extraction | 2–3 days → grounded answers in under 2 seconds |
| Retrieval quality (MRR) | 1.0 on both English and Arabic — full bilingual parity |
| Retrieval latency | 42ms |
| End-to-end response | under 1.2 seconds in production |
| OCR accuracy | 95% English · 90%+ Arabic |
| Chunk relevancy | 80% across retrieved context |
| Production stability | zero container crashes post async redesign |
| Client isolation | zero cross-tenant data-exposure incidents |
Figures are normalized to US-market engineering costs and typical portfolio-company scale, drawn from the S+3 Agile field record.