ollama

Run, manage, and create local language models

TLDR

Start the daemon required to run other commands

$ ollama serve

Run a model and chat with it

$ ollama run [model]

Run a model with a single prompt

$ ollama run [model] [prompt]

List downloaded models

$ ollama list

Pull a specific model

$ ollama pull [model]

List running models

$ ollama ps

Delete a model

$ ollama rm [model]

Create a model from a Modelfile

$ ollama create [new_model_name] [[-f|--file]] [path/to/Modelfile]

SYNOPSIS

ollama [OPTIONS] COMMAND [ARGUMENTS]

Common commands:
ollama run <model>
ollama pull <model>
ollama serve

run <model_name> [prompt]
    Runs a specified model. If no prompt is provided, it starts an interactive chat session with the model. Can also be used for single-shot prompts.

pull <model_name>
    Downloads a specified model from the Ollama model library to your local machine.

create <model_name> -f <modelfile_path>
    Creates a custom model from a Modelfile, allowing you to define parameters, prompts, and base models for new custom LLMs.

list (or ls)
    Lists all models that have been downloaded and are available on your local machine.

serve
    Starts the Ollama API server, which allows other applications and services to interact with your local models via a REST API.

rm <model_name> [...]
    Removes one or more specified models from your local storage, freeing up disk space.

ps
    Lists currently running Ollama model sessions.

push <model_name> [registry]
    Pushes a local model to a remote Ollama registry or host.

cp <source_model> <destination_model>
    Copies an existing model to a new name, creating a duplicate without re-downloading.

show <model_name> [info_type]
    Displays detailed information about a specified model, such as its license, parameters, or the underlying Modelfile.

help [command]
    Displays general help or specific help for a given Ollama subcommand.

DESCRIPTION

Ollama is an open-source tool designed to simplify the process of running large language models (LLMs) on your local machine.

It provides a command-line interface (CLI) and an API, making it exceptionally easy to download pre-trained models, serve them, and interact with them directly from your computer. Ollama handles the complexities of model weights, inference, and hardware acceleration (like GPU support), allowing users to focus on experimentation and development rather than setup intricacies.

It supports a wide range of popular models from its library and other sources, enabling offline AI applications, privacy-focused LLM usage, and rapid prototyping without relying on cloud services. Its Modelfile system further allows for customization and creation of new models based on existing ones, facilitating fine-tuning and specialized AI deployments. Ollama aims to democratize access to powerful LLMs by making local deployment straightforward for developers and enthusiasts alike.

CAVEATS

Running large language models requires significant computational resources. Users should ensure they have sufficient RAM (typically 8GB+ for smaller models, 16GB+ for medium, 32GB+ for larger ones) and, ideally, a compatible GPU (NVIDIA, AMD, or Apple Silicon) for optimal performance. Initial model downloads can be very large (several gigabytes each), requiring ample disk space. While Ollama simplifies deployment, understanding basic LLM concepts and Modelfile syntax is beneficial for advanced usage.

MODELFILES

Ollama uses Modelfiles, which are simple text files, to define and create custom models. These files specify parameters like the base model, system prompts, temperature, and other configuration options. This powerful feature allows users to easily fine-tune model behavior or create specialized versions from existing models without deep machine learning expertise.

API SERVER

When ollama serve is executed, it exposes a RESTful API (typically on localhost:11434). This enables programmatic interaction with models, allowing developers to integrate Ollama into custom applications, chatbots, web services, or other software, making local LLM capabilities accessible to a broader ecosystem.

GPU ACCELERATION

Ollama is designed to automatically detect and leverage available GPU hardware (such as NVIDIA CUDA, AMD ROCm, or Apple Metal) for accelerated inference. This significantly improves the speed and efficiency of running large language models compared to CPU-only execution, which is crucial for real-time applications and larger models.

HISTORY

Ollama emerged in 2023, quickly gaining traction for its straightforward approach to local LLM deployment. Prior to Ollama, running models locally often involved more complex setups, such as compiling models or configuring intricate environments. Ollama was developed with the goal of democratizing access to LLMs, packaging the necessary components into a single, easy-to-use binary. It continuously evolves, adding support for new models, improving performance, and enhancing its Modelfile capabilities, cementing its place as a popular tool for offline AI experimentation and development.