DevPortfolio
HomeAboutBlogProjectsContact

© 2025 Kartik Jain. All rights reserved.

GitHub
Twitter
Instagram
LinkedIn

    Understanding GPU VRAM

    Kartik Jain·
    November 9, 2025
    ·9 min read
    Machine Learning
    GPU
    VRAM
    Deep Learning
    AI
    Training
    Inference
    Understanding GPU VRAM

    Imagine attempting to run a marathon in shoes two sizes too small — uncomfortable and inefficient. Similarly, deploying Large Language Models without adequate VRAMUnderstanding VRAM requirements helps you choose the right GPU, optimize your models, and avoid out-of-memory errors (Training typically requires 3-4X more VRAM than inference!). can lead to frustrating bottlenecks.

    As the popularity of the LLMs are increasing, running large language models locally has become increasingly popular. So it's important to understand the VRAM requirements for training and inference. As it is the basic requirement to run any model.

    What is VRAM and Why Does It Matter?

    VRAM (Video Random Access Memory) is the dedicated memory built into your GPU, used to store data during computation. Just like RAM in your CPU, it temporarily holds the information your system is currently working on — but VRAM is specifically optimized for graphics and parallel data processing.

    So, why do we even need a GPU? Couldn’t a CPU handle the same tasks?

    The short answer: GPUs are built for speed through parallelism, while CPUs are built for versatility.

    A CPU has a few powerful cores that handle complex logic one task at a time.

    A GPU, on the other hand, has thousands of smaller cores that can process many simple operations simultaneously — perfect for the repetitive math in deep learning and graphics.

    During model training or inference, this difference becomes critical:

    • Using only a CPU means computations happen sequentially, so training that takes 1 hour on a GPU could take tens of hours—or even days—on a CPU.
    • Large models may not even fit in CPU memory efficiently, since GPUs have high-bandwidth VRAM designed to handle massive tensorsMulti-dimensional arrays that represent data in deep learning models. in real time.
    • The CPU can easily become bottlenecked—spending more time moving data around than actually computing.

    Basically, a CPU can run your model, but it’s like teaching one person to fill a stadium with buckets of water — while a GPU gives you thousands of people doing it at once.

    That’s why VRAM and GPU architecture are essential: they keep data close, fast, and parallel, enabling models to train and infer efficiently without running out of memory or patience.

    In machine learningA field of AI where models learn patterns from data to make predictions or generate new content., VRAM is critical because:

    • Model Storage: Your model's parametersLearnable numbers (weights & biases) inside a neural network that define how it behaves. must fit inside VRAM
    • Data Loading: Input data batches are copied to VRAM during training and inference
    • Intermediate Results: ActivationsTemporary outputs produced at each layer of the model while processing data., gradientsThe derivatives of the loss function used to adjust weights during training., and optimizer states all take space
    • Speed: VRAM is far faster than CPU memory or disk access

    Training vs. Inference: The Memory Difference

    One of the biggest distinctions in ML is that training requires significantly more VRAM than inference.

    Inference (Making Predictions)

    When running a trained model, you only store:

    • Model parametersThe weights and biases that define your model's behavior.
    • Input data
    • Intermediate activationsOutputs from each layer as data flows through the network.

    Training (Learning from Data)

    Training adds much more memory pressure because it also stores:

    • Gradients: DerivativesMathematical slopes that tell us how to adjust each parameter to reduce error. for every parameter
    • Optimizer States: Adam optimizerA popular training algorithm that stores momentum and variance for each parameter. keeps 2 additional copies of parameters
    • Larger Batches: Training often uses bigger batch sizesThe number of training samples processed together before updating model weights. for better stability
    Rule of Thumb: Training needs roughly 4-8× more VRAM than inference!

    Factors Involved in VRAM (GPU Memory) Calculation

    GPU VRAMVideo Random Access Memory — dedicated memory on your GPU for handling data during computation. usage in machine learning depends on several components. These can be grouped into five major categories that together determine how much memory your model consumes.

    1. Model Parameters (Weights + Biases)

    Model Parameters represent the core memory your model occupies — the learnable numbers (weights and biases)The learnable values in your neural network that get adjusted during training. that define its behavior.

    Depends on:

    Factor Description
    Number of Parameters Total learnable weights and biases (e.g., 110M for BERT Base, 7B for LLaMA 7B)
    Precision / Data Type Determines bytes per parameter. FP32 → 4 bytes, FP16/BF16 → 2 bytes, INT8 → 1 byte
    Model Memory (bytes) = Number of Parameters × Bytes per Parameter

    Example:

    LLaMA 7B in FP16 → 7,000,000,000 × 2 bytes = ~14 GB just for weights. (1 GB = ~10⁹ bytes)

    2. Activations (Intermediate Outputs)

    Activations are the temporary outputs produced during the forward pass — and reused in the backward pass during training.

    Depends on:

    Factor Description
    Batch Size Number of samples processed simultaneously. Larger = more activations.
    Sequence Length / Resolution Input size (tokens per sequence or image pixels). Larger = higher memory use.
    Hidden Size Width of each layer (e.g., 768 for BERT, 4096 for GPT).
    Number of Layers Deeper networks require more activation storage.
    Precision FP32 uses more memory than FP16 or BF16.
    Activation Memory ∝ Batch Size × Sequence Length × Hidden Size × Layers
    Rule of Thumb: Activations often take as much or more memory than model parameters — especially during training.

    3. Gradients

    Gradients are stored only during training — they represent how much each weight should change to reduce loss.

    Each parameter has a corresponding gradient (same size, same precision). So, if model weights occupy 4 GB, gradients typically consume another ~4 GB.

    Gradient Memory ≈ Model Memory

    4. Optimizer States

    Optimizer States store extra data used by training algorithms like Adam or RMSProp (e.g., momentum and variance).
    Optimizer Memory Overhead
    SGD ~1× model size
    Adam / AdamW ~2× model size (momentum + variance)
    Adagrad / RMSProp ~2× model size
    8-bit Optimizers (bitsandbytes) ~0.25× model size

    5. Framework & GPU Overheads

    Framework Overhead includes everything not directly related to your model tensors.
    • CUDA kernels and buffers
    • cuDNN workspace
    • Framework runtime cache (PyTorch / TensorFlow)
    • Temporary tensors (reshaping, attention cache, etc.)
    💡 Usually adds 1–2 GB or about 10–20% extra VRAM beyond model usage.

    Putting It All Together

    Inference VRAM

    Inference VRAM = Model Memory + Activation Memory + Overhead

    Where:

    • Model Memory: Size of model parameters (calculated above)
    • Activation Memory: Temporary outputs at each layer (depends on batch sizeHow many samples you process at once. and sequence lengthFor text models, the number of tokens in each input.)
    • Overhead: Framework overhead, CUDA kernelsGPU computation functions that require some memory to operate., buffers (~1-2 GB)

    Typically about 1.2× Model Memory.

    Example: LLaMA 7B Inference

    Model Specifications

    • Parameters: 7 billion
    • Precision: FP16 (2 bytes)
    • Batch size: 1
    • Sequence length: 2048

    Calculation

    Model Memory = (7,000,000,000 × 2) / 1,073,741,824 = 14,000,000,000 / 1,073,741,824 = ~13 GB Activation Memory ≈ 2 GB Overhead ≈ 1 GB Total VRAM ≈ 16 GB

    Training VRAM

    Training VRAM = Model Memory + Gradients + Optimizer States + Activation Memory + Overhead

    Where:

    • Model Memory: Same as inference
    • Gradients: Same size as model (1× model memory) — stores the derivativesMathematical slopes showing how to adjust each parameter to reduce error. for backpropagation
    • Optimizer States: For Adam optimizerA popular training algorithm that adapts learning rates for each parameter.: 2× model memory (momentumRunning average of gradients to smooth out updates. + varianceRunning average of squared gradients for adaptive learning rates.)
    • Activations: Depends on batch size, sequence length, and model architecture
    • Overhead: Framework overhead (~1-2 GB)

    Or approximately:

    Training VRAM ≈ Model Memory × (3–4) + Activations + Overhead

    Example: GPT-2 Medium Training

    Model Specifications

    • Parameters: 355 million
    • Precision: FP32 (4 bytes)
    • Batch size: 8
    • Sequence length: 1024
    • Optimizer: Adam

    Calculation

    Model Memory = (355,000,000 × 4) / 1,073,741,824 = ~1.32 GB Gradients = 1.32 GB (same as model) Optimizer States = 2.64 GB (2× model for Adam) Activations ≈ 3 GB (larger due to batch size) Overhead ≈ 1 GB Total VRAM ≈ 9.3 GB

    Quick Reference Formula

    VRAMtrain ≈ (Model Params × Bytes) × M + A + O
    Symbol Meaning Typical Value
    M Multiplier (optimizer overhead) 3–4
    A Activation Memory 0.5–2× model memory
    O Framework Overhead 1–2 GB
    💡 Pro Tip: The more layers, batch size, and higher precision you use, the faster VRAM fills up. Always validate with nvidia-smi or framework profiling tools to see real memory usage.

    Memory Optimization Strategies

    When you're running out of VRAM, here are proven techniques to reduce memory usage:

    Factor Description Effect
    Reduce Precision (FP16/BF16) Uses smaller data type Use FP1616-bit floating point — cuts memory in half compared to FP32. or INT88-bit integer quantization — reduces memory by 75% compared to FP32. instead of FP32 to cut memory by 50-75%.
    Gradient Checkpointing Recomputes activations during backprop Trade computation for memory by recomputing activationsIntermediate layer outputs that are normally stored for backpropagation. during the backward pass instead of storing them.
    Model Quantization (INT8/INT4) Compresses model weights Convert trained models to INT8 for inference — achieves 4× memory reduction with minimal accuracy loss.
    Batch Accumulation Processes small batches iteratively Reduce batch sizeNumber of samples processed together — smaller batches use less memory but may affect training stability. to lower activation memory (may affect training stability).
    Model Parallelism Splits model across GPUs Split your model across multiple GPUs when it doesn't fit on a single device.
    Offloading (CPU/NVMe) Moves inactive tensors off GPU Frees GPU VRAM dynamically
    Gradient Accumulation Simulates larger batches Simulate larger batches by accumulating gradientsUpdates to model parameters — accumulating them over multiple small batches mimics a large batch. over multiple small batches.

    Refrences

    • Introduction to vrams
    • calculate vram requirements
    • how much vram need fine-tuning

    More Articles

    Stop Your AI From Lying: Reduce Hallucinations

    Stop Your AI From Lying: Reduce Hallucinations

    Kartik Jain·2025-06-19
    AI
    LLM