auto-round
Low-bit quantization toolkit for LLMs and VLMs
TLDR
SYNOPSIS
auto-round --model MODEL [options]auto-round-best --model MODEL [options]auto-round-light --model MODEL [options]
DESCRIPTION
auto-round is a weight-only post-training quantization (PTQ) toolkit for LLMs and VLMs, developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, achieving high accuracy at ultra-low bit widths (down to 2 bits) with minimal calibration time.The toolkit supports CPU, Intel GPU (XPU), HPU, and CUDA back-ends and exports to several popular quantization formats including auto_round, auto_awq, auto_gptq, and gguf, so models can be served via Transformers, vLLM, SGLang, or llm-compressor without re-quantization.Three recipes are provided: auto-round (default balance), auto-round-best (slowest, highest accuracy, 4–5× slower), and auto-round-light (fastest, 2–3× speedup).
PARAMETERS
--model MODEL
Model identifier or local path (e.g. Qwen/Qwen3-0.6B).--scheme SCHEME
Quantization scheme such as W4A16, W2A16, W8A16.--bits N
Weight bit width: 2, 3, 4, or 8.--group_size N
Quantization group size (e.g. 32, 64, 128).--format FORMAT
Export format(s), comma-separated: autoround, autogptq, autoawq, gguf:q4km_, etc.--output_dir PATH
Directory where the quantized model is written.--dataset SPEC
Calibration data (local path or HuggingFace dataset). Supports name:num=N, :concat=True, :applychattemplate, and comma-separated lists.--iters N
Tuning iterations (0 for RTN, default 200, up to 1000 for best accuracy).--bs N
Batch size (default 8).--seqlen N
Calibration sequence length (default 2048).--nsamples N
Number of calibration samples (default 128, up to 512 for best).--lr RATE
Learning rate.--device_map SPEC
GPU assignment, e.g. auto or 0,1,2,3.--low_gpu_mem_usage
Reduce VRAM at the cost of more time.--enable_torch_compile
Use torch.compile (requires PyTorch 2.6+).--quant_lm_head
Also quantize the language-model head (auto_round format only).--adam
Use the AdamW optimizer instead of signed gradient descent.--eval
Evaluate the model after quantization.--eval_backend BACKEND
Evaluation engine, vllm or default Hugging Face.--tasks LIST
Comma-separated lm-eval-harness tasks (e.g. mmlu,lambadaopenai_).
DESCRIPTION OF FORMATS
auto_round
Native AutoRound format, supports lm-head quantization.auto_gptq
GPTQ-compatible format.auto_awq
AWQ-compatible format.gguf:q4_k_m, gguf:q2_k_s
GGUF formats for llama.cpp / Ollama-style runtimes.
CAVEATS
Calibration is sensitive to dataset quality and length; using domain-mismatched calibration data can degrade accuracy. Lower bit widths (2-3 bits) may need the best recipe to recover accuracy. Some export formats restrict feature combinations (e.g. --quant_lm_head only works with the auto_round format).
HISTORY
AutoRound was introduced by Intel as part of its LLM quantization stack. It distinguishes itself from older PTQ methods such as GPTQ and AWQ by jointly optimizing rounding and clipping with signed gradient descent, narrowing the accuracy gap to QAT at low bit widths while remaining a calibration-only method.
