Back to Research

LLM Inference Optimization Techniques

From quantization to speculative decoding: a comprehensive technical guide to accelerating large language model inference.

The gap between training and deploying large language models has created intense focus on inference optimization. While model capabilities scale with parameters, so do computational requirements — making efficient inference essential for production viability. This research examines the full stack of optimization techniques, from weight quantization to speculative decoding.

Quantization Methods

Quantization reduces model size and accelerates inference by representing weights with fewer bits. Three methods dominate production deployments, each with distinct characteristics.

GPTQ

GPTQ implements post-training quantization using Hessian-based optimization. The algorithm sequentially quantizes weights while compensating for quantization error using second-order information.

Key characteristics:

  • Excels in GPU inference
  • ~5× faster than GGUF using Marlin kernels
  • Requires calibration dataset for optimal results
  • Best with dedicated GPU inference

AWQ (Activation-aware Weight Quantization)

Developed by MIT-HAN lab, AWQ protects salient weights by observing activation distributions. Rather than treating all weights equally, AWQ identifies and preserves weights that significantly impact activations.

Performance characteristics:

  • ~95% quality retention (vs GPTQ's ~90%)
  • Particularly effective for instruction-tuned models
  • Integrated with TensorRT-LLM, vLLM, HuggingFace TGI

GGUF

Native to llama.cpp, GGUF provides optimal performance on CPU and Apple Silicon. The format supports mixed quantization levels across layers.

Characteristics:

  • ~92% quality retention
  • Quantizes in minutes (vs hours for GPTQ/AWQ)
  • Excellent for CPU and Apple Silicon deployment
  • Supports dynamic quantization strategies
Throughput Impact

AWQ and GPTQ at 4-bit process ~3× more requests/second than full-precision BF16 models at comparable quality levels.

FlashAttention Evolution

The FlashAttention series has revolutionized attention computation through IO-aware algorithm design. Understanding this evolution is essential for optimal deployment.

FlashAttention (Dao et al., NeurIPS 2022)

The original FlashAttention introduced IO-aware attention computation, accounting for reads/writes between GPU HBM (high bandwidth memory) and on-chip SRAM.

Results:

  • 2-4× speedup, up to 7.6× on GPT-2
  • 10× memory savings at 2K sequence length
  • 20× memory savings at 4K sequence length

FlashAttention-2 (2023)

FlashAttention-2 achieved 2× speedup over FA-1 through improved parallelism and reduced non-matmul FLOPs. Performance reaches 50-73% of theoretical maximum FLOPs/s on A100, up to 225 TFLOPs/s.

FlashAttention-3 (2024)

FlashAttention-3 introduced Hopper architecture optimizations with FP8 Tensor Core support, nearly doubling TFLOPs/s compared to FA-2 on H100 GPUs.

FlashAttention-4 (2025)

FlashAttention-4 targets the Blackwell architecture, achieving approximately 20% speedup over cuDNN kernels on B200 GPUs.

PagedAttention and vLLM

Kwon et al. introduced PagedAttention at SIGOPS 2023 (arXiv:2309.06180), applying OS-inspired virtual memory paging to KV cache management.

The Problem

Existing inference systems waste 60-80% of KV cache memory due to fragmentation and over-allocation. This severely limits batch sizes and throughput.

The Solution

PagedAttention divides the KV cache into fixed-size blocks that can be non-contiguously allocated, similar to virtual memory pages. This achieves <4% memory waste.

vLLM Performance

Comparison Throughput Improvement
vs FasterTransformer (same latency) 2-4×
vs HuggingFace Transformers 8.5-15×
Optimal configurations Up to 24×

KV-Cache Management Strategies

KV-cache size scales with sequence length and batch size, often becoming the memory bottleneck. Several architectural innovations address this challenge.

Multi-Query Attention (MQA)

MQA uses a single shared K/V head across all query heads:

  • 10-100× smaller KV cache
  • 12× faster inference
  • Some quality degradation on complex tasks

Grouped-Query Attention (GQA)

GQA balances MQA efficiency with MHA quality by grouping query heads to share K/V heads. Used in LLaMA 2/3, Gemma 3, and Qwen3.

Results: 75% memory savings (4× reduction) for 32K context with minimal quality impact.

Multi-Head Latent Attention (MLA)

DeepSeek-V2 introduced MLA, using low-rank factorized projection for cache compression:

  • Up to 93% reduction in cache size
  • Nearly 6× generation throughput
  • Maintains quality through learned compression

Speculative Decoding (EAGLE Series)

Speculative decoding accelerates generation by drafting multiple tokens with a small model, then verifying with the target model in parallel.

EAGLE (ICML 2024)

EAGLE operates at the feature level rather than token level, enabling more efficient drafting. Results on LLaMA2-Chat 70B:

  • 2.7-3.5× latency speedup
  • Doubled throughput

EAGLE-2 (EMNLP 2024)

EAGLE-2 introduced dynamic draft tree structure, adapting speculation depth based on acceptance rates. This achieves ~1.8× faster than EAGLE-1.

EAGLE-3 (NeurIPS 2025)

EAGLE-3 represents the current state-of-the-art with training-time testing and multi-layer feature fusion:

Metric EAGLE-3 Performance
vs Vanilla Autoregressive 3.0-6.5× speedup
vs EAGLE-2 20-40% improvement
LLaMA-3.3-70B 4.0-4.8× speedups

Continuous Batching

Yu et al. introduced iteration-level scheduling in Orca (OSDI 2022), enabling dynamic batch composition at each decoding iteration.

Traditional vs Continuous Batching

Traditional batching processes sequences together until all complete, wasting compute on padding. Continuous batching allows new sequences to enter and completed sequences to exit at each iteration.

Results: 23× throughput improvement vs static batching with higher GPU utilization.

Inference Engine Recommendations

Hardware Recommended Engine Notes
H100/B200 TensorRT-LLM with FP8 Maximum throughput
Consumer GPUs vLLM + AWQ Best quality/speed tradeoff
CPU/Apple Silicon llama.cpp with GGUF Optimized for non-GPU

TensorRT-LLM Performance

  • 10,000 tokens/s on H100
  • 70% faster than llama.cpp on RTX 4090
Hardware Dependency

All benchmarks are highly hardware-dependent. Results vary significantly based on GPU architecture, memory bandwidth, batch size, and model architecture. Always benchmark on target hardware before production deployment.

Implementation Recommendations

  1. Start with vLLM + AWQ for most production deployments
  2. Use GGUF for CPU/edge deployment scenarios
  3. Enable FlashAttention by default (usually automatic in modern frameworks)
  4. Consider speculative decoding for latency-sensitive applications
  5. Implement continuous batching for throughput-oriented workloads
  6. Profile memory usage to optimize batch sizes for available hardware

References

  • Dao et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention." NeurIPS 2022
  • Kwon et al. (2023). "PagedAttention." SIGOPS 2023, arXiv:2309.06180
  • Li et al. (2025). "EAGLE-3: Scaling up Inference Acceleration." NeurIPS 2025, arXiv:2503.01840
  • Lin et al. (2023). "AWQ: Activation-aware Weight Quantization." MLSys 2024
  • Frantar et al. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023
  • Yu et al. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022