The gap between training and deploying large language models has created intense focus on inference optimization. While model capabilities scale with parameters, so do computational requirements — making efficient inference essential for production viability. This research examines the full stack of optimization techniques, from weight quantization to speculative decoding.
Quantization Methods
Quantization reduces model size and accelerates inference by representing weights with fewer bits. Three methods dominate production deployments, each with distinct characteristics.
GPTQ
GPTQ implements post-training quantization using Hessian-based optimization. The algorithm sequentially quantizes weights while compensating for quantization error using second-order information.
Key characteristics:
- Excels in GPU inference
- ~5× faster than GGUF using Marlin kernels
- Requires calibration dataset for optimal results
- Best with dedicated GPU inference
AWQ (Activation-aware Weight Quantization)
Developed by MIT-HAN lab, AWQ protects salient weights by observing activation distributions. Rather than treating all weights equally, AWQ identifies and preserves weights that significantly impact activations.
Performance characteristics:
- ~95% quality retention (vs GPTQ's ~90%)
- Particularly effective for instruction-tuned models
- Integrated with TensorRT-LLM, vLLM, HuggingFace TGI
GGUF
Native to llama.cpp, GGUF provides optimal performance on CPU and Apple Silicon. The format supports mixed quantization levels across layers.
Characteristics:
- ~92% quality retention
- Quantizes in minutes (vs hours for GPTQ/AWQ)
- Excellent for CPU and Apple Silicon deployment
- Supports dynamic quantization strategies
AWQ and GPTQ at 4-bit process ~3× more requests/second than full-precision BF16 models at comparable quality levels.
FlashAttention Evolution
The FlashAttention series has revolutionized attention computation through IO-aware algorithm design. Understanding this evolution is essential for optimal deployment.
FlashAttention (Dao et al., NeurIPS 2022)
The original FlashAttention introduced IO-aware attention computation, accounting for reads/writes between GPU HBM (high bandwidth memory) and on-chip SRAM.
Results:
- 2-4× speedup, up to 7.6× on GPT-2
- 10× memory savings at 2K sequence length
- 20× memory savings at 4K sequence length
FlashAttention-2 (2023)
FlashAttention-2 achieved 2× speedup over FA-1 through improved parallelism and reduced non-matmul FLOPs. Performance reaches 50-73% of theoretical maximum FLOPs/s on A100, up to 225 TFLOPs/s.
FlashAttention-3 (2024)
FlashAttention-3 introduced Hopper architecture optimizations with FP8 Tensor Core support, nearly doubling TFLOPs/s compared to FA-2 on H100 GPUs.
FlashAttention-4 (2025)
FlashAttention-4 targets the Blackwell architecture, achieving approximately 20% speedup over cuDNN kernels on B200 GPUs.
PagedAttention and vLLM
Kwon et al. introduced PagedAttention at SIGOPS 2023 (arXiv:2309.06180), applying OS-inspired virtual memory paging to KV cache management.
The Problem
Existing inference systems waste 60-80% of KV cache memory due to fragmentation and over-allocation. This severely limits batch sizes and throughput.
The Solution
PagedAttention divides the KV cache into fixed-size blocks that can be non-contiguously allocated, similar to virtual memory pages. This achieves <4% memory waste.
vLLM Performance
| Comparison | Throughput Improvement |
|---|---|
| vs FasterTransformer (same latency) | 2-4× |
| vs HuggingFace Transformers | 8.5-15× |
| Optimal configurations | Up to 24× |
KV-Cache Management Strategies
KV-cache size scales with sequence length and batch size, often becoming the memory bottleneck. Several architectural innovations address this challenge.
Multi-Query Attention (MQA)
MQA uses a single shared K/V head across all query heads:
- 10-100× smaller KV cache
- 12× faster inference
- Some quality degradation on complex tasks
Grouped-Query Attention (GQA)
GQA balances MQA efficiency with MHA quality by grouping query heads to share K/V heads. Used in LLaMA 2/3, Gemma 3, and Qwen3.
Results: 75% memory savings (4× reduction) for 32K context with minimal quality impact.
Multi-Head Latent Attention (MLA)
DeepSeek-V2 introduced MLA, using low-rank factorized projection for cache compression:
- Up to 93% reduction in cache size
- Nearly 6× generation throughput
- Maintains quality through learned compression
Speculative Decoding (EAGLE Series)
Speculative decoding accelerates generation by drafting multiple tokens with a small model, then verifying with the target model in parallel.
EAGLE (ICML 2024)
EAGLE operates at the feature level rather than token level, enabling more efficient drafting. Results on LLaMA2-Chat 70B:
- 2.7-3.5× latency speedup
- Doubled throughput
EAGLE-2 (EMNLP 2024)
EAGLE-2 introduced dynamic draft tree structure, adapting speculation depth based on acceptance rates. This achieves ~1.8× faster than EAGLE-1.
EAGLE-3 (NeurIPS 2025)
EAGLE-3 represents the current state-of-the-art with training-time testing and multi-layer feature fusion:
| Metric | EAGLE-3 Performance |
|---|---|
| vs Vanilla Autoregressive | 3.0-6.5× speedup |
| vs EAGLE-2 | 20-40% improvement |
| LLaMA-3.3-70B | 4.0-4.8× speedups |
Continuous Batching
Yu et al. introduced iteration-level scheduling in Orca (OSDI 2022), enabling dynamic batch composition at each decoding iteration.
Traditional vs Continuous Batching
Traditional batching processes sequences together until all complete, wasting compute on padding. Continuous batching allows new sequences to enter and completed sequences to exit at each iteration.
Results: 23× throughput improvement vs static batching with higher GPU utilization.
Inference Engine Recommendations
| Hardware | Recommended Engine | Notes |
|---|---|---|
| H100/B200 | TensorRT-LLM with FP8 | Maximum throughput |
| Consumer GPUs | vLLM + AWQ | Best quality/speed tradeoff |
| CPU/Apple Silicon | llama.cpp with GGUF | Optimized for non-GPU |
TensorRT-LLM Performance
- 10,000 tokens/s on H100
- 70% faster than llama.cpp on RTX 4090
All benchmarks are highly hardware-dependent. Results vary significantly based on GPU architecture, memory bandwidth, batch size, and model architecture. Always benchmark on target hardware before production deployment.
Implementation Recommendations
- Start with vLLM + AWQ for most production deployments
- Use GGUF for CPU/edge deployment scenarios
- Enable FlashAttention by default (usually automatic in modern frameworks)
- Consider speculative decoding for latency-sensitive applications
- Implement continuous batching for throughput-oriented workloads
- Profile memory usage to optimize batch sizes for available hardware
References
- Dao et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention." NeurIPS 2022
- Kwon et al. (2023). "PagedAttention." SIGOPS 2023, arXiv:2309.06180
- Li et al. (2025). "EAGLE-3: Scaling up Inference Acceleration." NeurIPS 2025, arXiv:2503.01840
- Lin et al. (2023). "AWQ: Activation-aware Weight Quantization." MLSys 2024
- Frantar et al. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023
- Yu et al. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022