Get Onboard
Case Study20254 min read

Hybrid RAG Architecture for Multilingual Financial Document Intelligence

How a bilingual retrieval system collapsed a 2–3 day manual deal-research workflow into grounded answers in under two seconds.

Context
AgileCatalyst.ai · Stealth GCC Fintech ($200M fund)
Role
AI Product Design & Delivery
Scale
$200M fund · 100–200 docs/company · multi-tenant · English + Arabic
Timeframe
2025–2026
MRR 1.0 retrieval (EN + AR) · 42ms retrieval · <1.2s end-to-end · 95%/90%+ OCR
  • GenAI
  • Infrastructure Economics
  • Rag
  • Multilingual Ai
"Isolation you can argue with in code is not isolation. Make it a property of the architecture."— AgileCatalyst.ai · Engineering Field Note

The Challenge

AgileCatalyst.ai serves a stealth GCC-based fintech startup running a dedicated $200M investment fund, where investment analysts evaluate private-company pitch decks and financial documents to drive capital deployment. The core workflow required analysts to manually read through unstructured PDFs — many of them in Arabic — to extract deal-relevant signals. That manual research and extraction phase took two to three days per deal and became a hard bottleneck as deal volume scaled — friction sitting directly on the critical path of a nine-figure capital base.

Three compounding problems defined the brief:

  • Document heterogeneity — scanned PDFs, digital PDFs, cap tables, financial models, and mixed-language decks, all in one knowledge base.
  • Arabic language complexity — right-to-left rendering and a morphologically rich script that breaks standard embedding models.
  • Multi-tenant isolation — a marketplace under strict client data segregation, where Client A must never be able to reach Client B's documents.

The Solution

We designed and deployed a production Hybrid RAG (Retrieval-Augmented Generation) system that lets analysts query their document knowledge base in natural language — English or Arabic — and receive grounded, sub-second answers. It was built ground-up with bilingual retrieval, asynchronous ingestion, and hard architectural isolation as first-class constraints rather than afterthoughts.

The production pipeline

StageMechanismWhy it matters
01 · IngestionUpload → /kb_train endpoint → async background job → immediate ACKRedesigned from sync to async after containers crashed on 50MB+ files. Unlocked production stability.
02 · OCR + ParsingGemini Pro (digital) + Tesseract (scanned) + PyMuPDF (structured) + Arabic Reshaper / BiDi95% EN, 90%+ AR accuracy, with custom handling for cap tables and financial row/column structures.
03 · ChunkingHeader-based + recursive + custom logic → ~500-token chunksCustom chunker preserves table structure and prevents cross-row semantic bleed.
04 · Embeddingsparaphrase-multilingual-MiniLM-L12-v2 → Qdrant (company-scoped filter)Benchmarked 20+ models. Previous model: MRR 0.12 on Arabic. This one: MRR 1.0.
05 · Hybrid RetrievalDense (vector) + Sparse (BM25) → Reciprocal Rank Fusion → Top-KBM25 catches Arabic morphology dense embeddings miss. Combined: MRR 1.0 at 42ms.
06 · GenerationTop-K chunks → 3-prompt system → GPT-4o → grounded answerFIFO memory (3 exchanges), faithfulness scorer, context-window guard, and cost-spike alerts.

Technology stack

LayerChoice
OCRGemini Pro + Tesseract + PyMuPDF; Arabic Reshaper + BiDi
ChunkingHybrid (header + recursive + custom), ~500 tokens
Embeddingsparaphrase-multilingual-MiniLM-L12-v2
Vector DBQdrant, company-scoped filter (hard isolation)
RetrievalHybrid BM25 + Dense, merged via Reciprocal Rank Fusion
LLMGPT-4o, 3-prompt system, FIFO memory (3 exchanges)
InfraFastAPI on AWS App Runner, Vite / AWS Amplify, MongoDB, Docker

Key Engineering Decisions

1 · Async ingestion. Synchronous ingestion crashed containers on files over 50MB. Moving to async with immediate acknowledgment and background processing resolved the instability and enabled large ingestion sessions — 10–20 files per batch, 100–200 documents per company knowledge base.

2 · Embedding-model selection. Arabic retrieval quality was the defining bottleneck. After benchmarking 20+ models, paraphrase-multilingual-MiniLM-L12-v2 reached MRR 1.0 on both English and Arabic test sets — up from MRR 0.12. This was the single highest-leverage technical decision in the project.

3 · Hybrid retrieval over dense-only. Dense vector search underperformed on Arabic queries because of morphological complexity. BM25 sparse retrieval captures the exact and near-exact keyword matches embeddings miss; Reciprocal Rank Fusion merges both result sets without a reranker, holding latency at 42ms while achieving MRR 1.0.

4 · Hard multi-tenant isolation. Client data isolation was non-negotiable. Company-scoped metadata filters are enforced as an architectural constraint at the Qdrant layer — not in application logic — so no cross-client document leakage is possible regardless of how a query is constructed.

Results and Impact

MetricOutcome
Deal research & extraction2–3 days → grounded answers in under 2 seconds
Retrieval quality (MRR)1.0 on both English and Arabic — full bilingual parity
Retrieval latency42ms
End-to-end responseunder 1.2 seconds in production
OCR accuracy95% English · 90%+ Arabic
Chunk relevancy80% across retrieved context
Production stabilityzero container crashes post async redesign
Client isolationzero cross-tenant data-exposure incidents

Figures are normalized to US-market engineering costs and typical portfolio-company scale, drawn from the S+3 Agile field record.

← Back to the library