How much VRAM does a local AI model need?

Learn how parameter count, quantization, context length and CPU offloading affect local-model memory use, with practical ranges for choosing a GPU.

By TreffikAI EditorialJune 12, 20264 min read

Graphics card and AI models with different memory requirements

“How much VRAM does this model need?” sounds like a simple question, but parameter count alone cannot answer it. Memory use also depends on weight precision, quantization, context length, model architecture and whether some layers are offloaded to system RAM.

The numbers below are planning ranges, not guarantees. Models with similar parameter counts can behave differently, so check the exact model file and runtime benchmarks before buying hardware.

What consumes memory

During inference, memory is used mainly by:

model weights, the largest fixed component,
KV cache, which grows with context length and concurrent sessions,
compute buffers, which depend on the runtime and hardware,
the operating system and other applications, because the model never receives every advertised gigabyte.

The model file size is a useful first estimate, but loading and using it normally requires additional headroom.

Approximate requirements

Model size	Typical 4-bit model	Comfortable VRAM range
2–4B	about 2–3 GB	4–6 GB
7–8B	about 4–6 GB	8 GB
12–14B	about 8–10 GB	12–16 GB
27–32B	about 16–22 GB	24 GB or more
70B	about 40 GB and above	multiple GPUs or offloading

“Comfortable” includes some room for context and runtime buffers. A model may start with less VRAM by offloading layers to system memory, usually at the cost of speed.

Why quantization changes the calculation

High-precision weights consume more memory. Quantization stores them with fewer bits, commonly eight or four, making a model easier to fit on consumer hardware.

The benefits include:

smaller downloads,
lower VRAM use,
access to larger models,
often faster loading.

The tradeoff can be lower quality, especially with aggressive quantization or tasks requiring precise numerical reasoning. Different methods do not behave identically. The Hugging Face bitsandbytes documentation explains widely used 8-bit and 4-bit approaches.

Context can break a simple estimate

A model that fits during a short conversation may run out of memory after its context window is increased. KV cache stores information required for previous tokens and grows with the input length.

This matters particularly for:

long-document analysis,
coding assistants,
multi-step agents,
extended chat sessions,
concurrent users.

Ollama's context documentation explicitly notes that larger contexts require more memory. Hardware tests should therefore use the context length expected in the real application.

VRAM versus system RAM

When a model does not fit entirely in GPU memory, some layers can be offloaded to system RAM. This may allow a larger model to run, but data transfer and computation outside the GPU reduce performance.

Offloading can be acceptable for occasional use, testing and batch jobs. In an interactive chat, the latency may be much more noticeable.

Do not assume that 16 GB of unified memory behaves exactly like 16 GB of dedicated VRAM. Memory architecture and bandwidth materially affect performance.

How to match a model to hardware

Measure free VRAM during normal use.
Keep at least 10–20% headroom.
Begin with a 4-bit model.
Set a realistic context length.
Measure tokens per second and time to first token.
Compare answer quality with a smaller model.

Bigger does not always mean more useful. A responsive, well-matched 7B model may outperform a slow 14B model for classification, data extraction or a narrow domain.

Example scenarios

Laptop without a dedicated GPU. Start with a 2–4B model and a short context. Responsiveness, heat and battery use matter.

GPU with 8 GB VRAM. Quantized 7–8B models are generally the most comfortable starting point.

GPU with 12–16 GB VRAM. Models around 14B become practical, or smaller models can use longer contexts.

GPU with 24 GB VRAM. Models near 30B become possible, although context and other processes still need headroom.

Multi-user server. Loading the model is only the first constraint. Concurrent requests, batching, cache and response-time targets must also be included.

When the cloud makes more sense

Buying a powerful GPU is not always economical. Cloud infrastructure can be better when usage is irregular, a very large model is needed only occasionally or the team does not want to maintain the environment.

Local hardware has an advantage for steady workloads, sensitive data, offline operation and predictable costs. Many teams use a hybrid design: routine tasks locally and the hardest requests in the cloud.

The most important advice

Do not buy a GPU based only on the parameter count in a model name. Download the exact quantization, set the target context and run tasks that represent the real workload.

Memory determines whether a model starts. Bandwidth, software quality and the nature of the task determine whether it is useful.

Tags:#vram #local llms #gpu #quantization #ai infrastructure