LinuxCommandLibrary

llama.cpp

TLDR

Run interactive chat

$ ./main -m [model.gguf] -i
copy
Generate text with prompt
$ ./main -m [model.gguf] -p "[Your prompt here]"
copy
Set context size
$ ./main -m [model.gguf] -c [4096] -p "[prompt]"
copy
Use multiple threads
$ ./main -m [model.gguf] -t [8] -p "[prompt]"
copy
Run server mode
$ ./server -m [model.gguf] --port [8080]
copy
Quantize model
$ ./quantize [model.gguf] [output.gguf] [q4_0]
copy

SYNOPSIS

main [options] -m model -p prompt

DESCRIPTION

llama.cpp is a port of Meta's LLaMA model to C/C++ for efficient CPU and GPU inference. It supports various quantization formats and runs LLMs on consumer hardware.
The project includes tools for model conversion, quantization, and serving.

PARAMETERS

-m model

Path to GGUF model file.
-p prompt
Input prompt.
-i
Interactive mode.
-c size
Context size.
-t threads
Number of threads.
-n tokens
Number of tokens to generate.
--temp temp
Temperature for sampling.
-ngl layers
GPU layers to offload.

SUPPORTED FORMATS

$ GGUF - Current format
Quantizations: q4_0, q4_1, q5_0, q5_1, q8_0
GPU: CUDA, Metal, OpenCL
copy

CAVEATS

Models must be converted to GGUF format. Memory requirements depend on model size and quantization. GPU support varies by backend.

HISTORY

llama.cpp was created by Georgi Gerganov in March 2023 after Meta released LLaMA weights, enabling local LLM inference.

SEE ALSO

llamafile(1), ollama(1), ggml(1)

Copied to clipboard