Running local LLM models with llama.cpp

Overview

I wanted a repeatable local setup for running GGUF models with llama.cpp on a laptop GPU. The test machine has an NVIDIA RTX 4060 Laptop GPU with 8 GB of VRAM, so the goal was not to force the whole model into GPU memory. The setup that worked best was a mix of GPU and CPU work, especially for MoE models.

These commands are meant as starting points. Replace the model paths with your own .gguf files.

Install the basics

On Arch Linux, install the build tools first:

sudo pacman -S git base-devel cmake

For NVIDIA GPU acceleration, check that the driver and CUDA toolkit are available:

nvidia-smi
nvcc --version

If either command is missing, install the needed packages:

sudo pacman -S nvidia-dkms nvidia-utils nvidia-settings cuda

Build llama.cpp with CUDA

CUDA build environment for llama.cpp on Arch Linux

Clone the repo:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Generate the CUDA build files:

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_BUILD_TESTS=OFF

Build it:

cmake --build build --config Release -j 16

Check that the binary works:

./build/bin/llama-cli --version

Run a local server

A basic server command looks like this:

./build/bin/llama-server \
  --model /path/to/model.gguf \
  --ctx-size 65536 \
  --n-gpu-layers 99 \
  --threads 8 \
  --threads-batch 12 \
  --batch-size 1024 \
  --ubatch-size 512 \
  --flash-attn on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --mlock \
  --host 0.0.0.0 \
  --port 8080

These were the flags I ended up tuning the most:

Flag	Why it matters
`--n-gpu-layers`	Pushes model layers to the GPU where possible.
`--ctx-size`	Sets the context window. Larger values need more memory.
`--threads`	Controls CPU threads for generation.
`--threads-batch`	Controls CPU threads for prompt processing.
`--batch-size`	Affects prompt processing throughput.
`--ubatch-size`	Smaller compute batches can help fit within memory limits.
`--flash-attn on`	Enables flash attention when supported.
`--cache-type-k` / `--cache-type-v`	Quantizes KV cache to reduce memory use.
`--mlock`	Tries to keep model memory resident.

MoE model tuning

MoE inference split between laptop GPU memory and CPU expert offload

For MoE models, --n-cpu-moe is worth testing. It keeps some expert work on the CPU instead of pushing all of it through limited VRAM.

./build/bin/llama-server \
  --model /path/to/moe-model.gguf \
  --n-gpu-layers 99 \
  --n-cpu-moe 31 \
  --ctx-size 65536 \
  --threads 8 \
  --threads-batch 12 \
  --batch-size 1024 \
  --ubatch-size 512 \
  --flash-attn on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --mlock

On my RTX 4060 laptop GPU, this kind of setup was enough to get roughly 30 to 40 tokens per second depending on the model file, context size, and batch settings.

The tradeoff is straightforward: larger context and higher cache precision use more memory. Lower KV cache precision and smaller micro-batches make the setup easier to fit.

Benchmarking

Use llama-bench instead of guessing:

./build/bin/llama-bench \
  --model /path/to/model.gguf \
  --n-gpu-layers 99 \
  --n-cpu-moe 31 \
  --flash-attn 1 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -p 1024 \
  -n 512 \
  -t 4,6,8,10,12,14 \
  -r 3

Then test batch and micro-batch sizes:

./build/bin/llama-bench \
  --model /path/to/model.gguf \
  --n-gpu-layers 99 \
  --n-cpu-moe 31 \
  --flash-attn 1 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -p 1024 \
  -n 512 \
  -t 8 \
  -b 512,1024,2048,4096 \
  -ub 256,512,1024 \
  -r 3

In my testing, generation speed stayed in the same rough range while prompt processing changed more noticeably with batch settings. For this machine, a good result was around 30+ tok/s generation with faster prompt processing after tuning batch size.

Practical config I would start from

For an 8 GB NVIDIA laptop GPU, this is the config I would try first:

./build/bin/llama-server \
  --model /path/to/model.gguf \
  --n-gpu-layers 99 \
  --n-cpu-moe 31 \
  --ctx-size 65536 \
  --threads 8 \
  --threads-batch 12 \
  --batch-size 1024 \
  --ubatch-size 512 \
  --flash-attn on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --mlock \
  --host 0.0.0.0 \
  --port 8080

If it crashes or runs out of memory, reduce these first:

--ctx-size
--batch-size
--ubatch-size

If it runs but feels slow, benchmark different values for:

--threads
--threads-batch
--n-cpu-moe

Notes

Use q8_0 KV cache if you have more memory and want better cache quality. Use q4_0 when fitting the model is the main problem.

Large context sizes are useful, but they are not free. On an 8 GB GPU, I would start with a stable context size and increase it after the server is already running.

The next useful step is to save the working command as a small shell script per model profile. After that, benchmark only the parts that change: threads, batch size, micro-batch size, and MoE CPU offload.

Overview#

Install the basics#

Build llama.cpp with CUDA#

Run a local server#

MoE model tuning#

Benchmarking#

Practical config I would start from#

Notes#