LinuxCommandLibrary
GitHubF-DroidGoogle Play Store

auto-round

Low-bit quantization toolkit for LLMs and VLMs

TLDR

Quantize a model with the default recipe
$ auto-round --model [Qwen/Qwen3-0.6B] --scheme "[W4A16]" --format "[auto_round]"
copy
Use the best recipe (slower, higher accuracy)
$ auto-round-best --model [model_id] --scheme "[W4A16]"
copy
Use the light recipe (faster)
$ auto-round-light --model [model_id] --scheme "[W4A16]"
copy
Quantize to 4-bit with multiple export formats
$ auto-round --model [model_id] --bits 4 --group_size 128 --format "[auto_round,auto_awq,auto_gptq]" --output_dir [path/to/output]
copy
Calibration-free RTN mode
$ auto-round --model [model_id] --bits 4 --iters 0
copy
Multi-GPU quantization
$ auto-round --model [model_id] --device_map "[0,1,2,3]"
copy
Evaluate an already-quantized model
$ auto-round --model [path/to/quantized] --eval --tasks [mmlu,lambada_openai]
copy

SYNOPSIS

auto-round --model MODEL [options]auto-round-best --model MODEL [options]auto-round-light --model MODEL [options]

DESCRIPTION

auto-round is a weight-only post-training quantization (PTQ) toolkit for LLMs and VLMs, developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, achieving high accuracy at ultra-low bit widths (down to 2 bits) with minimal calibration time.The toolkit supports CPU, Intel GPU (XPU), HPU, and CUDA back-ends and exports to several popular quantization formats including auto_round, auto_awq, auto_gptq, and gguf, so models can be served via Transformers, vLLM, SGLang, or llm-compressor without re-quantization.Three recipes are provided: auto-round (default balance), auto-round-best (slowest, highest accuracy, 4–5× slower), and auto-round-light (fastest, 2–3× speedup).

PARAMETERS

--model MODEL

Model identifier or local path (e.g. Qwen/Qwen3-0.6B).
--scheme SCHEME
Quantization scheme such as W4A16, W2A16, W8A16.
--bits N
Weight bit width: 2, 3, 4, or 8.
--group_size N
Quantization group size (e.g. 32, 64, 128).
--format FORMAT
Export format(s), comma-separated: autoround, autogptq, autoawq, gguf:q4km_, etc.
--output_dir PATH
Directory where the quantized model is written.
--dataset SPEC
Calibration data (local path or HuggingFace dataset). Supports name:num=N, :concat=True, :applychattemplate, and comma-separated lists.
--iters N
Tuning iterations (0 for RTN, default 200, up to 1000 for best accuracy).
--bs N
Batch size (default 8).
--seqlen N
Calibration sequence length (default 2048).
--nsamples N
Number of calibration samples (default 128, up to 512 for best).
--lr RATE
Learning rate.
--device_map SPEC
GPU assignment, e.g. auto or 0,1,2,3.
--low_gpu_mem_usage
Reduce VRAM at the cost of more time.
--enable_torch_compile
Use torch.compile (requires PyTorch 2.6+).
--quant_lm_head
Also quantize the language-model head (auto_round format only).
--adam
Use the AdamW optimizer instead of signed gradient descent.
--eval
Evaluate the model after quantization.
--eval_backend BACKEND
Evaluation engine, vllm or default Hugging Face.
--tasks LIST
Comma-separated lm-eval-harness tasks (e.g. mmlu,lambadaopenai_).

DESCRIPTION OF FORMATS

auto_round

Native AutoRound format, supports lm-head quantization.
auto_gptq
GPTQ-compatible format.
auto_awq
AWQ-compatible format.
gguf:q4_k_m, gguf:q2_k_s
GGUF formats for llama.cpp / Ollama-style runtimes.

CAVEATS

Calibration is sensitive to dataset quality and length; using domain-mismatched calibration data can degrade accuracy. Lower bit widths (2-3 bits) may need the best recipe to recover accuracy. Some export formats restrict feature combinations (e.g. --quant_lm_head only works with the auto_round format).

HISTORY

AutoRound was introduced by Intel as part of its LLM quantization stack. It distinguishes itself from older PTQ methods such as GPTQ and AWQ by jointly optimizing rounding and clipping with signed gradient descent, narrowing the accuracy gap to QAT at low bit widths while remaining a calibration-only method.

SEE ALSO

python(1), vllm(1), llama.cpp(1)

Copied to clipboard
Kai