auto-round

Low-bit quantization toolkit for LLMs and VLMs

TLDR

Quantize a model with the default recipe

$ auto-round --model [Qwen/Qwen3-0.6B] --scheme "[W4A16]" --format "[auto_round]"

Use the best recipe (slower, higher accuracy)

$ auto-round-best --model [model_id] --scheme "[W4A16]"

Use the light recipe (faster)

$ auto-round-light --model [model_id] --scheme "[W4A16]"

Quantize to 4-bit with multiple export formats

$ auto-round --model [model_id] --bits 4 --group_size 128 --format "[auto_round,auto_awq,auto_gptq]" --output_dir [path/to/output]

Calibration-free RTN mode

$ auto-round --model [model_id] --bits 4 --iters 0

Multi-GPU quantization

$ auto-round --model [model_id] --device_map "[0,1,2,3]"

Evaluate an already-quantized model

$ auto-round --model [path/to/quantized] --eval --tasks [mmlu,lambada_openai]

SYNOPSIS

auto-round --model MODEL [options]auto-round-best --model MODEL [options]auto-round-light --model MODEL [options]

auto-round is a weight-only post-training quantization (PTQ) toolkit for LLMs and VLMs, developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, achieving high accuracy at ultra-low bit widths (down to 2 bits) with minimal calibration time.The toolkit supports CPU, Intel GPU (XPU), HPU, and CUDA back-ends and exports to several popular quantization formats including auto_round, auto_awq, auto_gptq, and gguf, so models can be served via Transformers, vLLM, SGLang, or llm-compressor without re-quantization.Three recipes are provided: auto-round (default balance), auto-round-best (slowest, highest accuracy, 4–5× slower), and auto-round-light (fastest, 2–3× speedup).

PARAMETERS

--model MODEL

Model identifier or local path (e.g. Qwen/Qwen3-0.6B).

--scheme SCHEME

Quantization scheme such as W4A16, W2A16, W8A16.

--bits N

Weight bit width: 2, 3, 4, or 8.

--group_size N

Quantization group size (e.g. 32, 64, 128).

--format FORMAT

Export format(s), comma-separated: autoround, autogptq, autoawq, gguf:q4km_, etc.

--output_dir PATH

Directory where the quantized model is written.

--dataset SPEC

Calibration data (local path or HuggingFace dataset). Supports name:num=N, :concat=True, :applychattemplate, and comma-separated lists.

--iters N

Tuning iterations (0 for RTN, default 200, up to 1000 for best accuracy).

--bs N

Batch size (default 8).

--seqlen N

Calibration sequence length (default 2048).

--nsamples N

Number of calibration samples (default 128, up to 512 for best).

--lr RATE

Learning rate.

--device_map SPEC

GPU assignment, e.g. auto or 0,1,2,3.

--low_gpu_mem_usage

Reduce VRAM at the cost of more time.

--enable_torch_compile

Use torch.compile (requires PyTorch 2.6+).

--quant_lm_head

Also quantize the language-model head (auto_round format only).

--adam

Use the AdamW optimizer instead of signed gradient descent.

--eval

Evaluate the model after quantization.

--eval_backend BACKEND

Evaluation engine, vllm or default Hugging Face.

--tasks LIST

Comma-separated lm-eval-harness tasks (e.g. mmlu,lambadaopenai_).

DESCRIPTION OF FORMATS

auto_round

Native AutoRound format, supports lm-head quantization.

auto_gptq

GPTQ-compatible format.

auto_awq

AWQ-compatible format.

gguf:q4_k_m, gguf:q2_k_s

GGUF formats for llama.cpp / Ollama-style runtimes.

CAVEATS

Calibration is sensitive to dataset quality and length; using domain-mismatched calibration data can degrade accuracy. Lower bit widths (2-3 bits) may need the best recipe to recover accuracy. Some export formats restrict feature combinations (e.g. --quant_lm_head only works with the auto_round format).

HISTORY

AutoRound was introduced by Intel as part of its LLM quantization stack. It distinguishes itself from older PTQ methods such as GPTQ and AWQ by jointly optimizing rounding and clipping with signed gradient descent, narrowing the accuracy gap to QAT at low bit widths while remaining a calibration-only method.