Table of Contents
URL: https://www.progressiverobot.com/vllm-gpu-sizing-configuration-guide/
Introduction
Effectively sizing and configuring GPUs for vLLM inference starts with a clear understanding of the two fundamental phases of LLM processing, Prefill and Decode, and how each places different demands on your hardware.
This guide explains the anatomy of vLLM's runtime behavior, clarifies core concepts like memory requirements, quantization, and tensor parallelism, and provides practical strategies for matching GPU selection to real-world workloads. By exploring how these factors interact, you'll gain the knowledge needed to anticipate performance bottlenecks and make informed, cost-effective decisions when deploying large language models on GPU infrastructure.
Key Takeaways
- Prefill vs. Decode phases determine hardware needs: The prefill phase (processing input prompts) is memory-bandwidth bound and affects Time-To-First-Token, while the decode phase (generating outputs) is compute-bound and determines token generation speed.
- VRAM capacity sets absolute limits: Model weights and KV cache must fit within available GPU memory. A 70B model in FP16 requires 140GB for weights alone, making quantization essential for single-GPU deployments.
- KV cache grows dynamically: Unlike static model weights, the KV cache expands based on context length and concurrency. A 70B model with 32k context and 10 concurrent users needs approximately 112GB for FP16 cache or 56GB for FP8 cache.
- Quantization is the primary optimization lever: Reducing precision from FP16 to INT4 cuts memory usage by 75%, enabling large models to run on smaller GPUs. FP8 quantization offers the best balance of speed and quality on modern hardware.
- Tensor Parallelism enables larger models: When models exceed single-GPU capacity, TP shards weights across multiple GPUs, pooling VRAM at the cost of communication overhead. Single-GPU execution is faster when feasible.
The Anatomy of vLLM Runtime Behavior: Prefill vs. Decode
The Prefill Phase ("Reading")
This is the very first step of any request. vLLM takes the entire input prompt (user query + system prompt + any RAG context) and processes it all at once in highly parallel fashion.
- What happens: The model "reads" the context and populates the Key-Value (KV) cache with the mathematical representation of that context.
- The Bottleneck: Because it processes thousands of tokens in parallel, this phase is almost always memory-bandwidth bound. The speed limit is how fast the GPU can move giant weight matrices from VRAM into the compute cores. For more on GPU performance characteristics, see our guide on GPU performance optimization.
- Real-World Impact: This determines Time-To-First-Token (TTFT). If you have a massive 100k token document to summarize, the prefill phase is what makes the user wait before the first word appears.
The Decode Phase ("Writing")
Once the prefill is complete, vLLM enters an autoregressive loop to generate the output.
- What happens: The model generates one token, appends it to the sequence, and then runs the entire model again to generate the next token. This is inherently sequential for a single request.
- The Challenge: Loading enormous model weights from VRAM just to calculate a *single* token for one user is extremely inefficient; the GPU spends more time moving data than calculating.
- The Solution (Continuous Batching): To solve this, modern engines like vLLM do not process requests one by one. Instead, they use continuous batching. Requests enter and leave the batch dynamically. vLLM interleaves the prefill operations of new requests with the decode steps of ongoing requests in the same GPU cycle.
- The Bottleneck: When effectively batched, this phase becomes compute-bound (limited by raw TFLOPS), as the goal is to crunch as many parallel token calculations as possible to maximize total throughput.
Learn More: For a deeper analysis of static vs. continuous batching trade-offs, refer to this article: 🔗 Hugging Face: LLM Performance – Prefill vs. Decode
Linking Phases to Workloads & Hardware
Understanding which phase dominates your workload is essential for selecting the right hardware.
| Runtime Phase | Primary Action | Primary Hardware Constraint | Dominant Use Cases |
|---|---|---|---|
| Prefill | Processing long inputs in parallel. | Memory Bandwidth (TB/s) *(Crucial for fast TTFT)* | • RAG (Retrieval-Augmented Generation) • Long Document Summarization • Massive Few-Shot Prompting |
| Decode | Generating outputs sequentially. | Compute (TFLOPS) *(Crucial for fast token generation)* | • Interactive Chat & Customer Service • Real-time Code Generation • Multi-turn Agentic Workflows |
KV Cache at Runtime
During inference, vLLM relies heavily on a KV cache to avoid recomputing work it has already done.
- The Mechanics: In a transformer, each token is turned into key (K) and value (V) vectors inside the attention layers. Without a cache, the model would have to reprocess the entire history (tokens 0 … *t*) just to generate token *t*+1.
- The Solution: The KV cache solves this by storing the K and V vectors once and reusing them.
- Prefill: vLLM computes K/V for all prompt tokens and stores them immediately.
- Decode: For each new token, it retrieves the history from the cache and only computes the K/V for the *new* token.
- The Benefit: This turns attention from effectively quadratic work (re-reading the whole book to write every word) into linear work (just writing the next word).
The Trade-off: Dynamic Memory Growth: The price of this speed is memory. Every new token generated appends more entries to the cache. At runtime, KV cache usage grows dynamically based on:
- Prompt Length & Output Length: Longer conversations consume more VRAM.
- Concurrency: Each active request needs its own isolated cache.
- Model Size: Deeper models (more layers) and wider models (more heads) require larger caches per token.
Scaling Impact: This behavior is why two workloads using the same model can have vastly different hardware requirements. A 70B model might fit on a GPU, but if the KV cache grows too large during a long conversation, the server will run out of VRAM and crash. Understanding memory management is essential for production deployments, as covered in our fine-tuning LLMs guide.
Sizing Fundamentals: How Models, Precision, and Hardware Determine Fit
Once we understand how vLLM behaves at runtime, the next step is determining whether a model can run on a given GPU and what level of concurrency or context length it can support.
This section provides the mathematical formulas and decision trees needed to calculate static memory requirements, estimate KV cache growth, and systematically troubleshoot fit issues.
GPU Hardware Characteristics & Constraints
Before calculating model size, it is essential to understand the "container" we are trying to fit into. Different GPUs impose different hard limits on feasibility and performance.
Common Data Center GPU VRAM Capacities- These are the hard memory limits for the most common inference GPUs.
GPU Comparison for vLLM Inference and Training
| GPU Model | VRAM Capacity | Peak Dense TFLOPS (FP16 / FP8) | Primary Applications & Advantages |
|---|---|---|---|
| NVIDIA L40S | 48 GB | 362 / 733 | Cost-Effective Inference: Optimal for small-to-medium quantized models (7B–70B). |
| NVIDIA A100 | 40 GB / 80 GB | 312 / N/A | Previous Standard: 80GB version is excellent for tasks demanding high memory bandwidth. |
| NVIDIA H100 | 80 GB | 989 / 1,979 | Current High-End Standard: Features massive bandwidth, ideal for applications requiring long context lengths. |
| NVIDIA H200 | 141 GB | 989 / 1,979 | Significant Performance Boost: Allows for larger batch sizes or running 70B+ models with fewer required GPUs. |
| NVIDIA B300 | 288 GB | \~2,250 / 4,500 | Ultimate Density: Capable of fitting massive models (e.g., Llama 405B) with minimal GPU parallelism. |
| AMD MI300X | 192 GB | 1,307 / 2,614 | Massive Capacity: Perfectly suited for very large, unquantized models or processing huge batch sizes. |
| AMD MI325X | 256 GB | 1,307 / 2,614 | Capacity Optimized: Excellent choice for serving 70B+ models, especially those with very long context requirements. |
| AMD MI350X | 288 GB | 2,300 / 4,600 | High-Performance Flagship: Direct competitor to the B300, designed for massive-scale workloads. |
Even if a model fits in VRAM, the specific GPU architecture significantly impacts vLLM performance. Key metrics to consider are:
| Metric | Measured In | Impact on vLLM |
|---|---|---|
| VRAM Capacity | GB | Can it run? This sets the absolute maximum limit for the model size and context window. |
| Memory Bandwidth | TB/s | Prefill Speed. Crucial for RAG (Retrieval-Augmented Generation) and long-context summaries. High bandwidth ensures a fast Time-To-First-Token. |
| Compute (TFLOPS) | TFLOPS | Decode Speed. Essential for chat applications. High TFLOPS result in fast tokens/sec generation. |
| Interconnect | GB/s | Parallelism Cost. Any interconnect adds latency. Even with NVLink (the cloud provider's standard), Tensor Parallelism (TP) introduces synchronization overhead that reduces performance compared to single-GPU execution. |
Model Weight Footprint (Static Memory)
Every model must load its weights into GPU VRAM before vLLM can serve requests. The size of the weights depends entirely on the number of parameters and the precision chosen.
Formula for Static Weights
The estimated VRAM requirement (in GB) for a model can be calculated using the formula
**VRAM (GB) ≈ Parameters (Billions) × Bytes per Parameter
The table below illustrates the VRAM calculation for a Llama 3.1 70 Billion Parameter model at various quantization precisions
| Precision | Bytes per Parameter | Example: Llama 3.1 70B VRAM (GB) |
|---|---|---|
| FP16 / BF16 | 2 bytes | 70 x 2 \= 140GB |
| FP8 / INT8 | 1 byte | 70 x 1 \= 70GB |
| INT4 | 0.5 bytes | 70 x 0.5 \= 35GB |
Precision choice is the single biggest lever for feasibility. Quantizing a 70B model from FP16 to INT4 reduces its static footprint by 75%, moving it from "impossible on a single node" to "fits on a single A100". This makes quantization essential for cost-effective deployments on GPU cloud servers.
KV Cache Requirements (Dynamic Memory)
While weights determine if a model can start, the KV cache determines if it can scale. It is easy to underestimate the KV cache, leading to OOMs under load.
To accurately size a deployment, you need to estimate how much memory the cache will consume based on the expected context length and concurrency.
The "Field" Rule of Thumb (Quick Estimation)
For most customer conversations, the exact formula is too complex to calculate on the fly. Instead, use a "Per-Token" memory multiplier. This method gets you close enough for initial sizing decisions.
Simplified KV Cache Formula:
Total KV Cache (MB) = Total Tokens x Multiplier
(Where Total Tokens = Context Length x Concurrency)
Standard Multipliers:
| Model Size | Standard Multiplier (FP16 Cache) | Quantized Multiplier (FP8 Cache) |
|---|---|---|
| Small Models (7B – 14B) | 0.15 MB / token | 0.075 MB / token |
| Large Models (70B – 80B) | 0.35 MB / token | 0.175 MB / token |
Example Calculation:
A customer wants to run Llama 3 70B with a context of 32k and 10 concurrent users.
32,000 x 10 \= 320,000 tokens
320,000 x 0.35 MB \= 112,000 MB 112 GB
- Calculate Total Tokens:
- Apply Standard Multiplier (0.35):
- Check FP8 Option: If they enable FP8 cache, it drops to half: \~56 GB.
Verdict:
- FP16 Cache: 112 GB Cache + 140 GB Weights = 252 GB Total (Requires 4x H100s).
- FP8 Cache: 56 GB Cache + 140 GB Weights = 196 GB Total (Fits on 3x H100s, or tight on 2x H100s if weights are also quantized).
Exact Calculation & Tools
For detailed validation or corner cases, use the formal formula or an online calculator.
- Online Tool: LMCache KV Calculator
- Formal Formula:
Total KV Cache (GB) = (2 × n_layers × d_model × n_seq_len × n_batch × precision_bytes) / 1024^3
When Tensor Parallelism Is Required
Tensor Parallelism (TP) is a technique that shards (splits) a model's individual weight matrices across multiple GPUs. Effectively, it allows vLLM to treat multiple GPUs as a single, massive device with pooled VRAM.
Why use it? TP is primarily a requirement for feasibility, not a performance optimizer. You typically enable it when:
- Weights don't fit: The model is simply too big for one GPU (e.g., Llama 3 70B on a 24GB card).
- KV Cache limitations: The model fits, but leaves zero room for the KV cache, preventing you from running long contexts or high concurrency.
The "Tax" of Parallelism (Performance Impact) While TP unlocks massive memory, it introduces communication overhead. After every layer of computation, all GPUs must synchronize their partial results.
- If a model fits on one GPU: Running it on one GPU is almost always faster than running it on two. The single GPU has zero communication overhead.
- Interconnect Dependency: TP relies heavily on fast GPU-to-GPU bandwidth. If you use TP on cards without NVLink (e.g., over standard PCIe), inference speed can drop significantly due to synchronization latency. For deploying multi-GPU setups, consider using Kubernetes to orchestrate your vLLM deployments.
Learn More: For a deeper look at how Tensor Parallelism shards models and impacts latency, refer to this conceptual guide: 🔗 Hugging Face: Tensor Parallelism Concepts
Putting the Numbers to the Test: Sizing Scenarios
Before moving to advanced configurations, let's apply the math from the previous sections to real-world scenarios. This helps verify our understanding of "Fit" and introduces the practical constraints often missed in pure math.
The "Hidden" VRAM Tax
It is a common mistake to calculate Weights + Cache = Total VRAM and assume 100% utilization is possible. It is not.
- CUDA Context & Runtime: The GPU driver, PyTorch, and vLLM runtime reserve memory just to initialize (typically 2–4 GB).
- Activation Buffers: Temporary storage for intermediate calculations during the forward pass.
- Safe Sizing Rule: Always reserve \~4-5 GB of VRAM as "unusable" overhead. If your calculation leaves only 0.5 GB free, the server will likely crash.
Scenario A: The Easy Fit (Standard Chat)
- Hardware: 1x NVIDIA L40S (48 GB)
- Model: Llama 3 8B (FP16)
- Math:
- Weights:
8B params x 2 bytes = 16 GB - Overhead:
-4 GB - Remaining for Cache:
48 - 16 - 4 = 28 GB - Cache Capacity:
28,000 MB / 0.15 MB per token = 186,000 tokens.- Verdict: Excellent Fit.
- This setup can handle massive loads (e.g., 60 concurrent users with 3k context each).
- *Result:* High throughput, low cost.
Scenario B: The "Weight Failure" (Large Model, Single GPU)
- Hardware: 1x NVIDIA H100 (80 GB)
- Model: Llama 3 70B (FP16)
- Math:
- Weights:
70B params x 2 bytes = 140 GB - Verdict: Hard Fail.
- The weights (140 GB) physically exceed the GPU capacity (80 GB).
- *Solution:* You must use Tensor Parallelism (2x GPUs) or Quantization (Section 3).
Scenario C: The "Cache Trap" (Fits, but cannot run)
- Hardware: 1x NVIDIA H100 (80 GB)
- Model: Llama 3 70B (FP8 Quantized)
- Math:
- Weights:
70B params x 1 byte = 70 GB - Overhead: -4 GB
- Remaining for Cache:
80 - 70 - 4 = 6 GB - Cache Capacity:
6,000 / 0.175 MB per token (FP8) = 34,000 tokens total.- Verdict: Risky / Poor Fit.
- The model *loads*, but you have almost no room to work.
- Impact: With 10 concurrent users, each gets only 3.4k context. If a user pastes a long document (4k tokens), the system will OOM.
- *Lesson:* Just because the weights fit, doesn't mean the workload fits. This scenario usually requires a second GPU or a smaller model.
Scenario D: The Solution (Tensor Parallelism)
Let's fix the "Cache Trap" from Scenario C by adding a second GPU. This demonstrates how Tensor Parallelism (TP) pools memory resources.
- Hardware: 2x NVIDIA H100 (80 GB each \= 160 GB Total Available)
- Model: Llama 3 70B (FP8 Quantized)
- Math:
- Total VRAM: 160 GB
- Weights: -70 GB (Spread across both GPUs)
- Overhead: -8 GB (~4 GB per GPU)
- Remaining for Cache:
160 - 70 - 8 = 82 GB - Cache Capacity:
82,000 / 0.175 MB per token (FP8) = 468,000 tokens total.- Verdict: Production Ready.
- By adding a second GPU, we moved from a "risky" 6 GB of cache space to a massive 82 GB.
- Impact: With 10 concurrent users, each now gets \~46k context. The OOM risk is gone, and the deployment can handle RAG or long documents easily.
Quantization: The Art of "Squeezing" Models
As demonstrated in the previous sizing scenarios, VRAM is the primary bottleneck for LLM inference. Quantization is the technique of reducing the precision of the numbers used to represent data, effectively trading a tiny amount of accuracy for massive gains in memory efficiency and speed.
It is critical to distinguish between the two types of quantization used in vLLM, as they address different constraints.
Type 1: Model Weight Quantization (The "Static" Fix)
This involves compressing the massive, static weight matrices of the pre-trained model before it is loaded.
- Goal: Make a model fit onto a GPU where its full-precision weights would exceed VRAM capacity.
- vLLM Implementation: While vLLM can quantize weights dynamically at startup, it is generally more efficient to load a model that has already been quantized using high-performance kernels like AWQ (Activation-aware Weight Quantization) or GPTQ. These specialized formats offer better accuracy retention and faster decode speeds than generic on-the-fly conversion.
- Impact: Reduces static VRAM usage by 50% (FP8/INT8) to 75% (INT4/AWQ), significantly increasing the VRAM remaining for the KV cache.
Type 2: KV Cache Quantization (The "Dynamic" Fix)
This involves compressing the intermediate Key and Value states stored in memory during sequence generation.
- Goal: Make a model scale to support higher concurrent batches or longer context windows.
- vLLM Implementation: This is controlled via a runtime flag (
--kv-cache-dtype). - Recommendation: On modern hardware supporting FP8 tensor cores (e.g., NVIDIA H100, L40S, AMD MI300X), enabling FP8 KV Cache (
fp8) is highly recommended. It effectively doubles the available context capacity with negligible impact on model quality. - Impact: Halves the per-token memory requirement discussed in Section 2 (e.g., reducing the multiplier for a 70B model from ~0.35 MB/token to ~0.175 MB/token).
vLLM GPU Precision Formats
Not all quantization formats are created equal. The choice depends on the available hardware architecture and the desired balance between model size and accuracy.
| Precision / Format | Bytes per Param | Accuracy Impact | Best Hardware Support | Recommended Use Case |
|---|---|---|---|---|
| FP16 / BF16 (Base) | 2 | None (Reference) | All modern GPUs | The Gold Standard. Use whenever VRAM capacity permits. |
| FP8 (Floating Point 8) | 1 | Negligible | H100, H200, L40S, MI300X | Modern Default. The best balance of speed and quality on new hardware. Ideal for KV cache. |
| AWQ / GPTQ (INT4 variants) | ~0.5 | Low/Medium | A100, L40S, Consumer | The "Squeeze" Option. Essential for fitting huge models on older or smaller GPUs. Excellent decode speed. |
| Generic INT8 | 1 | Medium | Older GPUs (V100, T4) | Legacy. Generally superseded by FP8 on newer hardware or AWQ for extreme compression. |
Strategic Application & Trade-offs
Deciding when to apply quantization requires balancing practical constraints against workload sensitivity. While powerful, quantization involves fundamental trade-offs that must be considered during deployment planning.
Key Considerations: Accuracy & Hardware
Before determining the scenario, consider these two foundational constraints:
- Accuracy vs. Compression: Aggressive quantization (e.g., INT4) can degrade performance on sensitive tasks involving complex reasoning or code generation. FP8 is generally considered safe for most chat and RAG use cases.
- Hardware Compatibility: The chosen format must match the hardware's capabilities. For example, FP8 quantization requires GPUs with FP8 tensor cores (NVIDIA Ada/Hopper or AMD CDNA3 architectures) to realize performance gains.
When Quantization Should Be Recommended
Based on the trade-offs above, quantization is appropriate in a wide range of real-world scenarios and is often the default choice in enterprise environments:
- Large models that don’t fit in FP16: INT4/INT8 is often the only way to serve 70B-class models on single 48GB or 80GB GPUs.
- High concurrency workloads: Reduced VRAM usage frees up significant space for the KV cache, allowing for more active sequences and longer prompts.
- RAG and enterprise chat: These workloads typically tolerate small accuracy deviations well without impacting the final user experience.
- Cost-optimization efforts: Quantization allows workloads to run on smaller, less expensive GPU SKUs while maintaining acceptable performance. This is particularly valuable when deploying on GPU cloud servers, where you can balance performance and cost based on your specific requirements.
When Quantization Should Be Avoided
Quantization is not universally suitable. Some workloads are highly sensitive to precision loss and should remain in FP16/BF16 whenever possible:
- Code generation and debugging: Lower precision can degrade structured reasoning, leading to subtle syntax errors or logical flaws.
- Math, finance, and scientific queries: Tasks requiring exact calculations benefit significantly from higher precision formats to avoid rounding errors.
- Evaluation, benchmarking, or regression testing: Small accuracy drift can invalidate comparisons between model versions or setups.
- Agentic workflows with multi-step reasoning: Compounded errors across multiple steps can reduce the system’s overall reliability and task completion rate.
Putting It All Together: From Requirements to a Deployment Plan
Up to this point, we have covered vLLM runtime behavior (Section 1), memory fundamentals (Section 2), and quantization strategies (Section 3).
This section connects these concepts into a repeatable decision framework. It moves from theory → practice, providing a structured workflow to evaluate feasibility, select hardware, and build a deployment plan.
Step 1: The Sizing Questionnaire
To accurately size a vLLM deployment, you must extract specific technical details from the workload description. Abstract goals like "fast inference" are insufficient. Use these five questions to gather the necessary data:
- "What is the maximum context length you need to support?"
- *Why:* Determines KV cache size (and thus, OOM risk).
- "What is your target concurrency (simultaneous users)?"
- *Why:* Multiplies the KV cache requirement.
- "What is the acceptable latency (TTFT and Tokens/Sec)?"
- *Why:* Determines if you need high bandwidth (H100) or just decent capacity (L40S).
- "Is model accuracy critical (Math/Code), or is 'good enough' acceptable (Chat)?"
- *Why:* Determines if you can use Quantization (INT4/FP8) to save money.
- "Do you have a hard budget constraint?"
- *Why:* Helps decide between optimizing for maximum speed (H100) vs. price-performance (L40S).
Step 2: Select Model Size and Precision
Once requirements are clear, identify the smallest model and highest precision that meets the quality bar.
- Precision is the Lever: Lower precisions (INT4/FP8) make large models feasible on cheaper hardware.
- The "Rule of 70B": A 70B model in FP16 requires exotic hardware (multi-GPU). The same model in INT4 fits on commodity hardware (single GPU).
- Guidance:
- *Chat/Assistant:* Use INT4 or FP8.
- *Code/Reasoning:* Use FP16 or FP8 (Avoid INT4).
Step 3: Hardware Feasibility Check
Verify the fit using the math from Section 2.
- Static Fit (Weights): Does
Params * Precisionfit in VRAM?
- *If No:* Quantize (Step 2) or add GPUs (Tensor Parallelism).
- Dynamic Fit (Cache): Is there room for
Context * Concurrency * Multiplier?
- *If No:* Reduce concurrency, shorten context, or enable FP8 KV Cache.
- Workload Fit (Bandwidth):
- *Long RAG/Summarization:* Needs High Bandwidth (H100/A100).
- *Standard Chat:* Needs High Compute (L40S).
Step 4: Recommend the GPU Strategy
With feasibility confirmed, select the specific GPU SKU. Use this "Cheat Sheet" for common scenarios.
Common Configuration Outcomes
| Workload Scenario | Recommended Configuration | Rationale |
|---|---|---|
| Standard Chat (8B–14B) | NVIDIA L40S (48GB) | Best Value. Massive compute for decode; 48GB easily fits small models + huge cache. |
| Large Chat (70B, Cost-Sensitive) | L40S (INT4) or A100 (INT4) | The "Squeeze." Quantization allows fitting 70B on a single card, avoiding the complexity of multi-GPU setups. |
| High-Performance Chat (70B) | NVIDIA H100 (FP8) AMD MI300X (FP16/FP8) | Modern Standard. H100 uses FP8 to fit and speed up inference. AMD Advantage: MI300X's 192GB VRAM allows running 70B models with massive batch sizes on a single card. |
| Massive Context / RAG | NVIDIA H200 AMD MI300X / MI325X | Bandwidth & Capacity Kings. AMD: With 192GB (MI300X) or 256GB (MI325X), these are the best options for extreme context (128k+) without needing 4-8 GPUs. |
| Uncompromised Quality (70B FP16) | 2x H100 (Tensor Parallelism) 1x AMD MI300X | The "Single-Card" Hero. NVIDIA: Requires 2 cards to fit 140GB of weights. AMD: Fits the full 70B FP16 model on a single GPU, eliminating the latency penalty of Tensor Parallelism. |
| Ultra-Scale / Next-Gen (405B+) | NVIDIA B300 AMD MI350X | The Frontier. Designed for massive model density. MI350X (288GB) rivals NVIDIA's Blackwell generation for fitting 400B+ MoE models efficiently. |
Step 5: Validate with Metrics
No paper sizing is perfect.
- Check TTFT: If high, your prefill is bottlenecked (bandwidth saturation).
- Check Inter-Token Latency: If high, your batch size might be too aggressive (compute saturation).
- Check KV Cache Usage: If consistently >90%, you are at risk of OOM and should enable Chunked Prefill or reduce concurrency.
Frequently Asked Questions
GPU memory requirements depend on model size, precision, context length, and concurrency. As a rule of thumb, FP16 models need approximately 2GB per billion parameters for weights alone. A 70B model requires 140GB for FP16 weights, but only 35GB with INT4 quantization. Additionally, you must account for KV cache memory, which grows with context length and concurrent users. For a 70B model with 32k context and 10 users, expect approximately 112GB for FP16 cache or 56GB for FP8 cache.
Tensor parallelism (TP) shards model weight matrices across multiple GPUs within each layer, allowing multiple GPUs to work on the same computation simultaneously. This pools VRAM but requires synchronization after each layer, adding communication overhead. Pipeline parallelism (PP) distributes model layers across GPUs sequentially, with each GPU processing different layers. TP is typically used when a model is too large for a single GPU, while PP is more common in training scenarios. For inference, TP is the standard approach when models exceed single-GPU capacity.
Quantization is recommended when models don't fit in available VRAM, when you need to support higher concurrency or longer context windows, or when cost optimization is a priority. FP8 quantization is ideal for modern hardware (H100, L40S, MI300X) and offers minimal accuracy loss. INT4 quantization is necessary for fitting large models on smaller GPUs but should be avoided for code generation, math, and scientific tasks where precision matters. For chat and RAG workloads, quantization is often the default choice.
Use the per-token multiplier method for quick estimation: multiply total tokens (context length × concurrency) by the model-specific multiplier. For small models (7B-14B), use 0.15 MB per token for FP16 cache or 0.075 MB for FP8 cache. For large models (70B-80B), use 0.35 MB per token for FP16 cache or 0.175 MB for FP8 cache. For exact calculations, use the formula: Total KV Cache (GB) = (2 × n_layers × d_model × n_seq_len × n_batch × precision_bytes) / 1024³, or use online tools like the LMCache KV Calculator.
Yes, vLLM) can be deployed on GPU cloud servers. the cloud provider offers GPU Droplets with NVIDIA GPUs that support vLLM's requirements. When deploying, ensure your selected GPU has sufficient VRAM for your model size and workload. For cost-effective deployments, consider using quantized models (INT4 or FP8) to fit larger models on smaller GPU instances. the cloud provider's GPU Droplets provide NVLink connectivity, which is essential for efficient tensor parallelism when using multiple GPUs.
Practical Use-Cases of vLLM GPU Inference
Building on the foundational understanding of how factors like model size, precision, GPU architecture, KV cache, and batching influence performance, in the upcoming tutorials, we will apply these concepts to practical vLLM workloads.
For each use case, we will address three key questions to determine the optimal setup:
- Workload Definition: What are the defining characteristics? (e.g., prompt vs. output length, concurrency, latency sensitivity).
- Sizing Priorities: What factors are the bottleneck? (e.g., weights vs. KV cache, bandwidth vs. compute).
- Configuration Pattern: Which specific flags and hardware choices reliably perform well?
Use Case 1: Interactive Chat & Assistants
- Focus: Low Latency (Decode-Bound).
- Goal: Smooth streaming and fast "typing speed" for users.
- Key Constraint: Compute (TFLOPS) and Inter-Token Latency.
Use Case 2: High-Volume Batch Processing
- Focus: Max Throughput (Compute-Bound).
- Goal: Process millions of tokens per hour for offline jobs (e.g., summarization).
- Key Constraint: Total System Throughput (Tokens/Sec).
Use Case 3: RAG & Long-Context Reasoning
- Focus: Context Capacity (Memory-Bound).
- Goal: Fitting massive documents or history into memory without crashing.
- Key Constraint: VRAM Capacity and Memory Bandwidth (Prefill Speed).
Conclusion
Properly sizing and configuring GPUs for vLLM requires understanding the fundamental trade-offs between model size, precision, context length, and concurrency. The prefill and decode phases have different hardware requirements, with prefill demanding high memory bandwidth and decode requiring high compute throughput. Quantization serves as the primary lever for fitting large models on available hardware, while tensor parallelism enables scaling beyond single-GPU limits.
The key to successful deployment is matching your workload characteristics to the right hardware configuration. Interactive chat applications prioritize compute for fast token generation, while RAG and long-context workloads need massive VRAM capacity and high memory bandwidth. By following the sizing framework outlined in this guide, you can systematically evaluate feasibility, select appropriate hardware, and optimize your vLLM deployment for production workloads.
Next Steps
Ready to deploy vLLM on GPU infrastructure? Explore these resources to get started:
- Deploy on GPU cloud servers: Get hands-on experience with vLLM by deploying on the cloud provider's GPU Droplets. Learn how to set up your environment and configure vLLM for optimal performance in our GPU Droplets documentation.
- Related Tutorials: Deepen your understanding of LLM deployment and optimization:
- Run GPT-OSS vLLM on AMD GPU Droplet
- Learn about fine-tuning LLMs on a budget using GPU cloud servers
- Explore GPU performance optimization techniques for deep learning workloads
- Understand container orchestration for scaling LLM deployments
- Build a RAG application using GPU Droplets
- Try the cloud provider Products:
- GPU Droplets: High-performance GPU instances for AI and machine learning workloads
- Kubernetes: Orchestrate and scale your vLLM deployments across multiple nodes
- an app platform: Deploy and manage your LLM applications with ease
For more technical guides and best practices, visit the the cloud provider Community Tutorials or explore our resources on AI and machine learning.