The Friendli Guide to Inference Performance Optimization is a practical playbook for teams deploying large language models in production. It provides a systematic, end-to-end methodology for selecting GPU infrastructure, benchmarking under realistic traffic conditions, and tuning inference configurations to meet strict latency SLAs while maximizing throughput and cost efficiency.

Rather than relying on synthetic benchmarks or guesswork, this guide walks practitioners through a proven four-step process—size, benchmark, tune, and select—to identify the optimal operating point for their workload. Along the way, it demystifies critical performance metrics like TTFT, TPOT, and System TPS, and shows how to balance them as a unified system. Whether you’re scaling a real-time application or optimizing for cost at high volume, this guide equips you with the tools and frameworks to make confident, data-driven decisions about inference performance.