whichllm

Rank local LLMs that actually run well on your hardware

TLDR

Detect hardware and list the best-fit local models

$ whichllm

Show only your detected hardware

$ whichllm hardware

Restrict ranking to CPU-only machines

$ whichllm --cpu-only

Simulate a specific GPU for purchase planning

$ whichllm --gpu "[RTX 4090]"

Plan in reverse: what GPU runs a given model

$ whichllm plan [model_name]

Download a model and chat with it interactively

$ whichllm run [model_name]

Print a Python snippet for using a model

$ whichllm snippet [model_name]

Emit JSON for scripting

$ whichllm --json

whichllm detects local hardware (GPU model and VRAM, CPU, RAM, OS) and ranks open-weight large language models from HuggingFace and Ollama by how well they will actually run on that machine. Instead of treating "fits in VRAM" as the only criterion, it combines a fit check with recency-aware benchmark scores from sources such as LiveBench, Artificial Analysis, Aider, and the Chatbot Arena ELO leaderboard, and applies penalties for quantization, partial offload, and MoE architectures.The tool is designed for the common practical question "which model should I download tonight" rather than for marketing claims. The default invocation prints a short ranked table; subcommands extend the same engine to launch interactive sessions, plan hardware upgrades, or emit code snippets for direct integration.

PARAMETERS

hardware

Print detected GPU, CPU, RAM, and OS information without ranking models.

run model

Download model via Ollama and start an interactive chat session.

plan model

Reverse lookup: estimate which GPU or RAM tier is needed to run model at usable speed.

snippet model

Print a ready-to-paste Python snippet that loads model from HuggingFace or Ollama.

--gpu name

Override hardware detection and rank as if running on the named GPU (e.g. "RTX 4090").

--cpu-only

Restrict ranking to models that run acceptably without a GPU.

--top N

Show the top N ranked models instead of the default short list.

--quant type

Filter results by quantization (e.g. Q4KM, Q5KM, Q80, fp16_).

--profile usecase_

Bias ranking towards a specific profile (coding, vision, math, general).

--json

Emit machine-readable JSON instead of the formatted table.

--refresh

Bypass the local cache and refetch benchmark data.

--version

Print version and exit.

--help

Print help and exit.

CONFIGURATION

~/.cache/whichllm/

Cached benchmark snapshots; cleared by --refresh.

Ollama

When present, whichllm run delegates model download and serving to a local Ollama daemon.

CAVEATS

Rankings depend on third-party benchmarks; new models appear before their scores stabilise, so use --refresh if a recent release is missing. Hardware detection works best on NVIDIA, AMD, and Apple Silicon; exotic accelerators may fall back to CPU-only estimates. The tool only recommends models — it does not enforce licensing constraints on the suggested weights.

HISTORY

whichllm was published in 2025 by Andyyyy64 as a Python utility distributed via uv, pip, and Homebrew. It emerged as the local-LLM ecosystem fragmented across HuggingFace, Ollama, and dozens of quantization formats, where simply checking VRAM size was no longer enough to pick a usable model. The project has continued to track new releases and benchmark updates through v0.5.x (2026).