llama.cpp

TLDR

Run interactive chat

$ ./main -m [model.gguf] -i

Generate text with prompt

$ ./main -m [model.gguf] -p "[Your prompt here]"

Set context size

$ ./main -m [model.gguf] -c [4096] -p "[prompt]"

Use multiple threads

$ ./main -m [model.gguf] -t [8] -p "[prompt]"

Run server mode

$ ./server -m [model.gguf] --port [8080]

Quantize model

$ ./quantize [model.gguf] [output.gguf] [q4_0]

SYNOPSIS

main [options] -m model -p prompt

llama.cpp is a port of Meta's LLaMA model to C/C++ for efficient CPU and GPU inference. It supports various quantization formats and runs LLMs on consumer hardware.
The project includes tools for model conversion, quantization, and serving.

PARAMETERS

-m model

Path to GGUF model file.

-p prompt

Input prompt.

-i

Interactive mode.

-c size

Context size.

-t threads

Number of threads.

-n tokens

Number of tokens to generate.

--temp temp

Temperature for sampling.

-ngl layers

GPU layers to offload.

SUPPORTED FORMATS

$ GGUF - Current format
Quantizations: q4_0, q4_1, q5_0, q5_1, q8_0
GPU: CUDA, Metal, OpenCL

CAVEATS

Models must be converted to GGUF format. Memory requirements depend on model size and quantization. GPU support varies by backend.

HISTORY

llama.cpp was created by Georgi Gerganov in March 2023 after Meta released LLaMA weights, enabling local LLM inference.