LinuxCommandLibrary
GitHubF-DroidGoogle Play Store

llamafile

Single-file executable for portable LLM inference

TLDR

Run a llamafile (launches chat in terminal and server on port 8080)
$ ./[model].llamafile
copy
Run in server-only mode
$ ./[model].llamafile --server
copy
Run in CLI mode with a prompt
$ ./[model].llamafile --cli -p "[prompt]"
copy
Run interactive chat mode
$ ./[model].llamafile --chat
copy
Load external model weights
$ llamafile -m [path/to/model.gguf]
copy
Set context size and number of threads
$ ./[model].llamafile -c [8192] -t [8] -p "[prompt]"
copy
Run server on a specific host and port
$ ./[model].llamafile --server --host [0.0.0.0] --port [8080]
copy
Offload layers to GPU and set temperature
$ ./[model].llamafile -ngl [999] --temp [0.7] -p "[prompt]"
copy

SYNOPSIS

llamafile [options]

DESCRIPTION

llamafile is a single-file executable that bundles llama.cpp with a model for portable LLM inference. The same file runs on Linux, macOS, Windows, FreeBSD, NetBSD, and OpenBSD without installation, built on Cosmopolitan Libc.By default, llamafile launches both a terminal chatbot and an HTTP server with a web UI on port 8080. It can also run in dedicated CLI, chat, or server modes.

PARAMETERS

-m model

Path to model weights file (if not embedded in the llamafile).
-p prompt
Input prompt text.
--cli
Run in CLI mode, answering a single prompt.
--chat
Run interactive chat mode with slash commands.
--server
Start HTTP server mode with web UI.
-c size
Context window size in tokens.
-t threads
Number of threads to use for computation.
-n count
Maximum number of tokens to generate.
-ngl n
Number of layers to offload to GPU.
--host addr
Server listening address (default: 127.0.0.1).
--port port
Server port (default: 8080).
--temp value
Sampling temperature (higher = more random).
--top-k n
Top-k sampling (default: 40).
--top-p value
Top-p nucleus sampling (default: 0.95).
--seed n
Random seed for reproducible output.
--grammar grammar
Apply BNF grammar to constrain output format.
--mmproj file
Multimodal projection model weights for vision models.
--image file
Image file input for multimodal models.
--jinja
Enable Jinja template support for chat templates.
-e
Enable prompt evaluation.

CAVEATS

File sizes can be large (several GB). Requires chmod +x on Unix systems. Apple Silicon may require code signing. Models are memory-mapped for efficiency.

HISTORY

llamafile was created by Justine Tunney at Mozilla in 2023, combining Cosmopolitan Libc's universal binary format with llama.cpp.

SEE ALSO

llama.cpp(1), ollama(1)

Copied to clipboard
Kai