Retrieval-Augmented Generation has evolved from a simple retrieve-then-generate pattern into a sophisticated family of architectures, each optimized for different use cases and quality requirements. Understanding these patterns and their tradeoffs is essential for building production-grade AI systems that deliver accurate, grounded responses.
This research examines eight distinct RAG architecture types, evaluation frameworks, chunking strategies, and vector database selection criteria based on peer-reviewed research and production deployments.
Eight RAG Architecture Types
1. Simple RAG
The foundational retrieve-then-generate pipeline remains effective for many use cases:
- Query → Embedding
- Vector Search → Top-K Retrieval
- Context Assembly → LLM Generation
Typical latency: 1-3 seconds. Simple RAG works well for FAQ systems and document Q&A where queries map directly to retrievable content.
2. Memory RAG
Memory RAG integrates conversation history for coherent multi-turn dialogue. Using patterns like ConversationBufferMemory, the system maintains context across interactions while still retrieving relevant documents for each query.
This architecture excels in customer support scenarios where understanding conversation context is essential for appropriate responses.
3. Branching RAG
Branching RAG implements multiple retrieval paths with parallel execution. Queries route to specialized retrievers (Q&A, summarization, factual lookup) based on intent classification.
Performance benchmarks show 15-30% better recall than single-path approaches, though at the cost of increased infrastructure complexity.
4. Adaptive RAG
Adaptive RAG dynamically adjusts retrieval parameters based on query complexity. The system modifies k (number of retrieved documents), retrieval method (dense vs. sparse vs. hybrid), and re-ranking aggressiveness based on query characteristics.
This pattern proves particularly effective for heterogeneous corpora where different document types benefit from different retrieval strategies.
5. Corrective RAG (CRAG)
Yan et al. introduced CRAG in January 2024 (arXiv:2401.15884), implementing a self-correction mechanism with a lightweight retrieval evaluator that returns confidence scores.
Three-Tiered Action System
- Correct (high confidence): Proceed with retrieved documents
- Incorrect (low confidence): Trigger web search fallback
- Ambiguous (medium confidence): Combine internal and external sources
CRAG uses a decompose-then-recompose algorithm, breaking complex queries into sub-questions and validating each retrieval independently.
6. Self-RAG
Asai et al. introduced Self-RAG in October 2023 (arXiv:2310.11511), implementing self-reflection through special tokens that guide the generation process.
Reflection Tokens
- Retrieve: Should I retrieve for this segment?
- IsRel: Is retrieved content relevant?
- IsSup: Is generation supported by context?
- IsUse: Is generation useful for the query?
Performance Results
| Metric | Self-RAG | Baseline |
|---|---|---|
| PopQA Accuracy | 55.8% | 14.7% (Llama2-13B) |
| Fact Verification | 81% | 71% (other techniques) |
| Biography Factuality | 80% | 71% (ChatGPT) |
7. Graph RAG
Microsoft introduced Graph RAG in April 2024 (arXiv:2404.16130), combining knowledge graph construction with community summarization using the Leiden algorithm.
Graph RAG excels at "What are the main themes?" queries where naive RAG fails. It's best suited for datasets in the 1M+ token range requiring holistic understanding across many documents.
The approach constructs entity-relationship graphs from source documents, then uses community detection to create hierarchical summaries that can answer global questions about the corpus.
8. Agentic RAG
Agentic RAG represents the most sophisticated pattern, implementing agent-based retrieval with tool use, autonomous decision-making, and multi-hop reasoning across knowledge bases.
The agent decides when to retrieve, which sources to query, how to combine information, and when sufficient information has been gathered. This pattern enables complex research tasks that require synthesizing information from multiple sources.
RAGAS Evaluation Framework
RAGAS (arXiv:2309.15217) provides reference-free evaluation using LLM-based metrics, enabling automated quality assessment without ground truth labels.
Core Metrics
| Metric | Measures | Method |
|---|---|---|
| Faithfulness | Factual consistency with context | Claim extraction and verification |
| Answer Relevancy | Pertinence to query | Reverse-engineered question comparison |
| Context Precision | Signal-to-noise ratio | Relevant chunk ranking analysis |
| Context Recall | Retrieval completeness | Ground truth comparison (when available) |
Score Interpretation
- 0.8-1.0: Excellent — production ready
- 0.6-0.8: Good — may need optimization
- Below 0.4: Poor — requires substantial changes
RAGAS integrates with Langfuse, Datadog, and Evidently AI for production monitoring.
Chunking Strategies
Chunking strategy significantly impacts retrieval quality. Benchmarks reveal substantial performance differences:
| Strategy | Accuracy | Recall | Cost |
|---|---|---|---|
| Page-level | 0.648 | High | Low |
| Semantic | +9% vs baseline | High | High |
| RecursiveCharacter (512) | Baseline | 85-90% | Low |
| LLM-based | Highest | Highest | Very High |
Best Practices
- General use: RecursiveCharacter with 400-512 tokens and 10-20% overlap
- High-value documents: Semantic chunking with embedding-based boundary detection
- Structured documents: Respect document structure (headers, sections) in chunking
Vector Database Comparison
| Database | Latency (p50) | Max Scale | Hybrid Search | Open Source |
|---|---|---|---|---|
| Pinecone | <10ms | Billions | Yes (API) | No |
| Weaviate | <50ms | Billions | Native | Yes |
| Qdrant | <10ms | Billions | Native | Yes |
| Chroma | ~20ms | Millions | Approximate | Yes |
Selection Framework
- Need managed + minimal ops → Pinecone
- Need hybrid search + flexibility → Weaviate or Qdrant
- Prototyping/lightweight → Chroma
Hybrid Search Best Practices
Hybrid search combines sparse (BM25) and dense (embedding) retrieval for improved recall. The recommended pipeline:
- Parallel Retrieval: BM25 (top-K) + Dense (top-K) simultaneously
- Fusion: Reciprocal Rank Fusion (RRF) with k=60
- Re-ranking: Cross-encoder or ColBERT (optional but recommended)
- Final Selection: Top-N for generation
Performance: 15-30% better recall than either method alone, with particularly strong gains on queries mixing keyword-oriented and semantic components.
Advanced Techniques (2024-2025)
HyDE (Hypothetical Document Embeddings)
HyDE generates a hypothetical answer document first, then uses its embedding for retrieval. This approach significantly enhances retrieval precision for complex queries by bridging the query-document vocabulary gap.
query → LLM generates hypothetical answer →
embed hypothetical → retrieve similar documents
Query Decomposition
Techniques including RQ-RAG, GMR, and RAG-Fusion decompose complex queries into simpler sub-queries. Results show 35% gain in document-level precision (arXiv:2510.18633) for multi-faceted questions.
No consensus exists on optimal RAG architecture — selection is highly dependent on use case, corpus size, and latency requirements. Start simple and add complexity only when metrics justify it.
Implementation Recommendations
- Start with Simple RAG to establish baselines before adding complexity
- Implement RAGAS evaluation early to measure improvement quantitatively
- Use hybrid search as a default for production systems
- Consider CRAG or Self-RAG when faithfulness metrics need improvement
- Deploy Graph RAG only for large corpora requiring holistic understanding
- Reserve Agentic RAG for complex multi-source research tasks
References
- Asai et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique." arXiv:2310.11511
- Yan et al. (2024). "Corrective RAG." arXiv:2401.15884
- Edge et al. (2024). "From Local to Global: A Graph RAG Approach." arXiv:2404.16130
- Es et al. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217
- Gao et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv:2312.10997