Multi-agent LLM systems represent one of the most significant architectural paradigms in modern AI development. As organizations move beyond single-model deployments toward collaborative AI ecosystems, understanding the design patterns, protocols, and frameworks that enable effective agent orchestration becomes essential.
This research examines the foundational architectural patterns documented in recent surveys by Liu et al. (2024) and Chen et al. (2024), alongside the emerging protocol standards that promise to reshape how AI agents communicate and collaborate.
Core Architectural Patterns
Multi-agent LLM systems employ four primary architectural patterns, each presenting distinct tradeoffs between coordination efficiency, fault tolerance, and scalability.
Star Architecture (Centralized/Supervisor-Worker)
The star architecture places a central agent that coordinates all communication between worker agents. This pattern creates clear control flow and simplified debugging, but can form bottlenecks at scale. Systems like AutoGen's supervisor architecture and LangGraph's supervisor tool-calling pattern implement this approach.
Qian et al. (2023), Hong et al. (2024), and Wu et al. (2023) have extensively documented this pattern's effectiveness for task decomposition scenarios where a clear hierarchy of responsibility enhances performance.
Star architectures excel when tasks can be cleanly decomposed and distributed, but require careful attention to the supervisor's context window limitations when managing many concurrent workers.
Hierarchical/Tree Architecture
Hierarchical architectures introduce multi-level supervision where supervisors manage other supervisors. LangGraph's hierarchical teams and MegaAgent's system-level parallelism implement this pattern, balancing control and distribution while adding complexity in level management.
This pattern proves particularly effective for large-scale systems requiring departmental organization, such as enterprise software development pipelines where different teams handle frontend, backend, and testing responsibilities.
Network/Mesh Architecture (Peer-to-Peer)
Network architectures enable every agent to communicate with every other agent, maximizing information flow but creating communication overhead that scales quadratically with agent count. CAMEL's role-playing framework demonstrates this approach.
The primary advantage lies in resilience: single-agent failure doesn't cripple the entire system. This makes mesh architectures attractive for mission-critical applications where redundancy matters more than efficiency.
Holonic/Hybrid Patterns
Holonic patterns combine elements from hierarchical and peer-to-peer approaches, creating systems optimized for specific performance metrics or highly dynamic environments. These patterns adapt their structure based on task requirements, switching between centralized coordination for complex planning and distributed execution for parallelizable subtasks.
Model Context Protocol (MCP)
Anthropic released the Model Context Protocol in November 2024 as an open standard for connecting LLM applications to external data sources and tools. The protocol uses JSON-RPC 2.0 messages, drawing inspiration from the Language Server Protocol (LSP) that revolutionized IDE tooling.
Architecture Components
MCP defines three primary architectural components:
- Hosts: LLM applications initiating connections (e.g., Claude Desktop, IDE extensions)
- Clients: Connectors within host applications maintaining one-to-one sessions with servers
- Servers: Services providing context and capabilities to the LLM
Server Primitives
MCP servers expose three types of primitives:
- Resources: Context and data for users or AI models (documents, database records, API responses)
- Prompts: Templated messages and workflows for common operations
- Tools: Functions that AI models can execute to perform actions
Transport Mechanisms
MCP supports two transport mechanisms:
- STDIO: Standard input/output for local integrations with minimal latency
- HTTP+SSE: Server-Sent Events over HTTP for remote connections with streaming support
Adoption and Ecosystem
By January 2025, MCP had achieved significant adoption with over 1,000 open-source connectors available. Major adopters include OpenAI, Google DeepMind, Block, Apollo, Zed, Replit, Codeium, and Sourcegraph. Official SDKs are available in Python, TypeScript, C#, and Java.
April 2025 security analysis identified several concerns: prompt injection vulnerabilities, tool permission issues allowing file exfiltration, lookalike tools silently replacing trusted ones, and OAuth authorization specification conflicts with enterprise practices. Production deployments should implement strict tool validation and permission scoping.
Agent2Agent (A2A) Protocol
Google announced A2A on April 9, 2025, at Google Cloud Next, subsequently donating it to the Linux Foundation on June 23, 2025. Over 100 companies support A2A, including AWS, Cisco, Microsoft, Salesforce, SAP, and ServiceNow.
Core Capabilities
| Capability | Description |
|---|---|
| Capability Discovery | Agents advertise capabilities via JSON "Agent Cards" |
| Task Management | Task objects with lifecycle support (immediate or long-running) |
| Collaboration | Agents exchange context, replies, artifacts, user instructions |
| UX Negotiation | Messages include "parts" with specified content types |
MCP vs A2A: Complementary Protocols
The key differentiator between MCP and A2A lies in their scope: MCP connects agents with tools and data, while A2A enables agents to collaborate as agents/users, not merely as tools. The protocols are designed to complement each other.
"Use MCP for tools and A2A for agents." — Protocol design guidance from the A2A specification
Memory Systems for Agents
Effective memory management remains one of the most challenging aspects of multi-agent system design. Three approaches have emerged as particularly influential.
MemGPT
Packer et al. introduced MemGPT at NeurIPS 2023, presenting an OS-inspired two-tier memory system:
- Tier 1 (Main Context/RAM): In-context core memories within the LLM's context window
- Tier 2 (External Context/Disk): Recall storage and archival storage for long-term persistence
The agent uses function calls to manage context window contents, reading from and writing to external data sources as needed. This approach enables virtually unlimited conversation history while maintaining relevant context.
A-MEM
Xu et al. (NeurIPS 2025) introduced A-MEM, which uses the Zettelkasten method for interconnected knowledge networks with dynamic indexing and linking. Performance benchmarks demonstrate impressive efficiency:
- 85-93% reduction in token usage versus baselines
- Approximately 1,200 tokens per memory operation
Mem0
Mem0 (2025) dynamically extracts, consolidates, and retrieves salient information using graph-based memory representations with Neo4j. Benchmarks show:
- 26% relative improvement in LLM-as-Judge metrics over OpenAI
- 91% lower p95 latency
- >90% token cost savings
Agent Frameworks Comparison
Three frameworks have emerged as leading options for production multi-agent systems, each with distinct architectural philosophies.
| Framework | Architecture | Strengths | Best For |
|---|---|---|---|
| LangGraph | Graph-based state machine | Fine-grained control, cycles, streaming | Complex workflows, production systems |
| CrewAI | Role-based crews | Intuitive team structure, rapid development | Business process automation |
| AutoGen | Conversation-centric | Flexible dialogue patterns, code execution | Research, iterative problem-solving |
LangGraph Deep Dive
LangGraph distinguishes itself through cyclic graph support, enabling iterative refinement workflows impossible in DAG-based systems. Key features include:
- Checkpointing for conversation continuity across sessions
- Human-in-the-loop via the
interrupt()primitive - Swarm-style multi-agent handoffs via
langgraph-swarm
CrewAI Production Metrics
CrewAI uses a Role-Goal-Backstory Framework with YAML configuration for rapid crew definition. Production deployments have demonstrated significant results:
- PwC boosted code-generation accuracy from 10% to 70%
- Approximately 90% reduction in processing time for back-office automation
AutoGen v0.4 Architecture
Microsoft's AutoGen (v0.4) provides a layered API structure:
- Core API: Event-driven asynchronous messaging
- AgentChat API: Rapid prototyping with pre-built agent types
- Extensions API: LLM clients, tools, and code execution environments
Coordination and Failure Handling
The MAST Taxonomy (Cemri et al., 2025) provides the most comprehensive analysis of multi-agent system failure modes, identifying 14 unique failure modes across 3 categories:
Failure Categories
- Specification Issues: Poor task decomposition, inadequate role definition
- Inter-Agent Misalignment: Communication breakdowns, memory management failures
- Task Verification Problems: Incorrect output verification (13.48% of observed failures)
With the same underlying model, MAS system workflow and prompt changes achieved maximum improvements of 15.6%. This suggests that architectural decisions matter significantly, even when model capabilities are held constant.
Production Benchmarks
AgentBench
AgentBench (Liu et al., ICLR 2024) established the first comprehensive LLM-as-Agent benchmark across 8 environments. The study identified main failure reasons as:
- Poor long-term reasoning
- Suboptimal decision-making under uncertainty
- Instruction following degradation in extended interactions
MultiAgentBench
MultiAgentBench (Zhu et al., March 2025) provides updated benchmarking across contemporary models:
- GPT-4o-mini achieves highest average task score
- Graph structure performs best in research scenarios
- Cognitive planning improves milestone achievement by 3%
Implementation Recommendations
Based on the research surveyed, we offer the following recommendations for production multi-agent system deployment:
- Start with star architecture for initial deployments, scaling to hierarchical patterns as complexity grows
- Implement MCP for tool integration and prepare for A2A adoption as the protocol matures
- Choose memory systems based on use case: MemGPT for conversational agents, A-MEM for knowledge-intensive tasks, Mem0 for graph-structured domains
- Use LangGraph for production systems requiring fine-grained control; CrewAI for rapid business automation prototyping
- Budget 15-20% of development time for failure handling and recovery mechanisms
References
- Wu et al. (2023). "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv:2308.08155
- Packer et al. (2023). "MemGPT: Towards LLMs as Operating Systems." NeurIPS 2023
- Xu et al. (2025). "A-MEM: Agentic Memory for LLM Agents." NeurIPS 2025
- Liu et al. (2024). "Multi-Agent Systems Survey." arXiv
- Chen et al. (2024). "LLM Agent Architectures." arXiv
- Cemri et al. (2025). "MAST: Multi-Agent System Taxonomy."
- Zhu et al. (2025). "MultiAgentBench." arXiv