LLM Inference Optimization Techniques

The gap between training and deploying large language models has created intense focus on inference optimization. While model capabilities scale with parameters, so do computational requirements — making efficient inference essential for production viability. This research examines the full stack of optimization techniques, from weight quantization to speculative decoding.

Quantization Methods

Quantization reduces model size and accelerates inference by representing weights with fewer bits. Three methods dominate production deployments, each with distinct characteristics.

GPTQ

GPTQ implements post-training quantization using Hessian-based optimization. The algorithm sequentially quantizes weights while compensating for quantization error using second-order information.

Key characteristics:

Excels in GPU inference
~5× faster than GGUF using Marlin kernels
Requires calibration dataset for optimal results
Best with dedicated GPU inference

AWQ (Activation-aware Weight Quantization)

Developed by MIT-HAN lab, AWQ protects salient weights by observing activation distributions. Rather than treating all weights equally, AWQ identifies and preserves weights that significantly impact activations.

Performance characteristics:

~95% quality retention (vs GPTQ's ~90%)
Particularly effective for instruction-tuned models
Integrated with TensorRT-LLM, vLLM, HuggingFace TGI

GGUF

Native to llama.cpp, GGUF provides optimal performance on CPU and Apple Silicon. The format supports mixed quantization levels across layers.

Characteristics:

~92% quality retention
Quantizes in minutes (vs hours for GPTQ/AWQ)
Excellent for CPU and Apple Silicon deployment
Supports dynamic quantization strategies

Throughput Impact

AWQ and GPTQ at 4-bit process ~3× more requests/second than full-precision BF16 models at comparable quality levels.

FlashAttention Evolution

The FlashAttention series has revolutionized attention computation through IO-aware algorithm design. Understanding this evolution is essential for optimal deployment.

FlashAttention (Dao et al., NeurIPS 2022)

The original FlashAttention introduced IO-aware attention computation, accounting for reads/writes between GPU HBM (high bandwidth memory) and on-chip SRAM.

Results:

2-4× speedup, up to 7.6× on GPT-2
10× memory savings at 2K sequence length
20× memory savings at 4K sequence length

FlashAttention-2 (2023)

FlashAttention-2 achieved 2× speedup over FA-1 through improved parallelism and reduced non-matmul FLOPs. Performance reaches 50-73% of theoretical maximum FLOPs/s on A100, up to 225 TFLOPs/s.

FlashAttention-3 (2024)

FlashAttention-3 introduced Hopper architecture optimizations with FP8 Tensor Core support, nearly doubling TFLOPs/s compared to FA-2 on H100 GPUs.

FlashAttention-4 (2025)

FlashAttention-4 targets the Blackwell architecture, achieving approximately 20% speedup over cuDNN kernels on B200 GPUs.

PagedAttention and vLLM

Kwon et al. introduced PagedAttention at SIGOPS 2023 (arXiv:2309.06180), applying OS-inspired virtual memory paging to KV cache management.

The Problem

Existing inference systems waste 60-80% of KV cache memory due to fragmentation and over-allocation. This severely limits batch sizes and throughput.

The Solution

PagedAttention divides the KV cache into fixed-size blocks that can be non-contiguously allocated, similar to virtual memory pages. This achieves <4% memory waste.

vLLM Performance

Comparison	Throughput Improvement
vs FasterTransformer (same latency)	2-4×
vs HuggingFace Transformers	8.5-15×
Optimal configurations	Up to 24×

KV-Cache Management Strategies

KV-cache size scales with sequence length and batch size, often becoming the memory bottleneck. Several architectural innovations address this challenge.

Multi-Query Attention (MQA)

MQA uses a single shared K/V head across all query heads:

10-100× smaller KV cache
12× faster inference
Some quality degradation on complex tasks

Grouped-Query Attention (GQA)

GQA balances MQA efficiency with MHA quality by grouping query heads to share K/V heads. Used in LLaMA 2/3, Gemma 3, and Qwen3.

Results: 75% memory savings (4× reduction) for 32K context with minimal quality impact.

Multi-Head Latent Attention (MLA)

DeepSeek-V2 introduced MLA, using low-rank factorized projection for cache compression:

Up to 93% reduction in cache size
Nearly 6× generation throughput
Maintains quality through learned compression

Speculative Decoding (EAGLE Series)

Speculative decoding accelerates generation by drafting multiple tokens with a small model, then verifying with the target model in parallel.

EAGLE (ICML 2024)

EAGLE operates at the feature level rather than token level, enabling more efficient drafting. Results on LLaMA2-Chat 70B:

2.7-3.5× latency speedup
Doubled throughput

EAGLE-2 (EMNLP 2024)

EAGLE-2 introduced dynamic draft tree structure, adapting speculation depth based on acceptance rates. This achieves ~1.8× faster than EAGLE-1.

EAGLE-3 (NeurIPS 2025)

EAGLE-3 represents the current state-of-the-art with training-time testing and multi-layer feature fusion:

Metric	EAGLE-3 Performance
vs Vanilla Autoregressive	3.0-6.5× speedup
vs EAGLE-2	20-40% improvement
LLaMA-3.3-70B	4.0-4.8× speedups

Continuous Batching

Yu et al. introduced iteration-level scheduling in Orca (OSDI 2022), enabling dynamic batch composition at each decoding iteration.

Traditional vs Continuous Batching

Traditional batching processes sequences together until all complete, wasting compute on padding. Continuous batching allows new sequences to enter and completed sequences to exit at each iteration.

Results: 23× throughput improvement vs static batching with higher GPU utilization.

Inference Engine Recommendations

Hardware	Recommended Engine	Notes
H100/B200	TensorRT-LLM with FP8	Maximum throughput
Consumer GPUs	vLLM + AWQ	Best quality/speed tradeoff
CPU/Apple Silicon	llama.cpp with GGUF	Optimized for non-GPU

TensorRT-LLM Performance

10,000 tokens/s on H100
70% faster than llama.cpp on RTX 4090

Hardware Dependency

All benchmarks are highly hardware-dependent. Results vary significantly based on GPU architecture, memory bandwidth, batch size, and model architecture. Always benchmark on target hardware before production deployment.

Implementation Recommendations

Start with vLLM + AWQ for most production deployments
Use GGUF for CPU/edge deployment scenarios
Enable FlashAttention by default (usually automatic in modern frameworks)
Consider speculative decoding for latency-sensitive applications
Implement continuous batching for throughput-oriented workloads
Profile memory usage to optimize batch sizes for available hardware

References

Dao et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention." NeurIPS 2022
Kwon et al. (2023). "PagedAttention." SIGOPS 2023, arXiv:2309.06180
Li et al. (2025). "EAGLE-3: Scaling up Inference Acceleration." NeurIPS 2025, arXiv:2503.01840
Lin et al. (2023). "AWQ: Activation-aware Weight Quantization." MLSys 2024
Frantar et al. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023
Yu et al. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022