LinuxCommandLibrary

clusterdb

Recluster PostgreSQL database tables

TLDR

Cluster a specific database

$ clusterdb [database_name]
copy

Cluster all databases
$ clusterdb [[-a|--all]]
copy

Cluster a specific table in a database
$ clusterdb [[-t|--table]] [table_name] [database_name]
copy

SYNOPSIS

clusterdb [connection options] [options] [dbname]
clusterdb [connection options] [options] --all

PARAMETERS

-a, --all
    Cluster all databases accessible by the specified user, except template databases. Only tables that have a cluster index defined will be processed.

-d dbname, --dbname=dbname
    Specify the name of the database to cluster. If not specified and --all is not used, the database name is read from the PGDATABASE environment variable, or defaults to the user name.

-e, --echo
    Echo the SQL commands that clusterdb generates and sends to the server. This is useful for debugging or understanding the underlying operations.

-q, --quiet
    Do not display progress messages. Only error messages will be shown.

-t table, --table=table
    Cluster only the specified table. Multiple -t switches can be given to cluster multiple tables. If used with --all, -t specifies tables to cluster in each database.

-v, --verbose
    Print verbose output, including detailed progress messages and the names of tables being clustered.

-z, --dry-run
    Show the commands that would be executed, but do not actually run them. This is useful for testing and planning without making changes.

--maintenance-db=dbname
    When --all is specified, this option specifies the database to connect to in order to discover other databases. If not specified, defaults to postgres or the user's name.

--cluster-options=options
    Pass options directly to the underlying CLUSTER SQL command. For example, --cluster-options=FREEZE to freeze all rows in the newly written table.

-h host, --host=host
    Database server host name or socket directory. If this value begins with a slash, it is used as the directory for the Unix domain socket.

-p port, --port=port
    Database server port number for the connection.

-U username, --username=username
    Connect as the specified database user. If not specified, defaults to the operating system user name.

-w, --no-password
    Never prompt for password. Useful for non-interactive scripts where a password is provided via environment variables or a password file.

-W, --password
    Force password prompt. Useful when password authentication is expected but a password is not provided in other ways.

--help
    Show help message and exit, providing a summary of command-line arguments.

--version
    Show version information and exit.

DESCRIPTION

The clusterdb command is a utility provided with PostgreSQL that re-clusters tables within a database. Clustering a table means physically reordering its data on disk according to an index, typically a primary key or a frequently used index. This process can significantly improve query performance, especially for queries involving range scans, sequential scans, or joins where data locality is crucial.

When data is frequently updated, deleted, or inserted, the physical order of rows can diverge from the logical order defined by an index. This 'data fragmentation' can lead to increased disk I/O as the database has to jump between different physical locations to retrieve logically contiguous data. clusterdb helps rectify this by rebuilding the table in the desired index order.

Essentially, clusterdb is a convenient wrapper around the SQL command CLUSTER. It allows administrators to cluster all previously clustered tables in a database, all tables in all databases, or a specific table, without needing to connect to psql and execute individual CLUSTER commands. It connects to the PostgreSQL server and issues the necessary CLUSTER commands, simplifying maintenance tasks.

CAVEATS

Clustering a table requires an ACCESS EXCLUSIVE lock on that table, which means it will block all other read and write operations on the table until the clustering is complete. This can lead to significant downtime for large tables and active systems. Therefore, clusterdb should typically be run during maintenance windows or periods of low activity. The operation also requires temporary disk space roughly equal to the size of the table being clustered, as it creates a new copy of the table. Only tables that have previously been clustered using the SQL CLUSTER command or an index created WITH (CLUSTER) will be processed by default; otherwise, it will report that the table has no cluster index.

WHEN TO USE CLUSTER

Clustering is most beneficial for tables that experience frequent sequential scans, range scans, or join operations on the clustered index. If the physical order of data has significantly deviated from the index order due to heavy INSERT/UPDATE/DELETE activity, re-clustering can dramatically reduce disk I/O and improve cache hit rates. It's often considered after a significant amount of data modification, or if query plans indicate high costs related to table scans that would benefit from sorted data. Regular monitoring of table statistics and query performance helps determine if and when clustering is necessary.

PERFORMANCE IMPLICATIONS

A well-clustered table can lead to faster data retrieval because related data rows are stored close together on disk. This reduces the number of disk pages that need to be read, improves CPU cache utilization, and minimizes random I/O, leading to better overall query performance for affected queries. However, the clustering process itself is resource-intensive, requiring considerable CPU and I/O resources during its execution. The trade-off between the maintenance overhead and the potential query performance gain should be carefully evaluated for each specific workload and table.

HISTORY

clusterdb has been a foundational part of the PostgreSQL client utility suite for many years, evolving alongside the database system itself. Its primary function as a convenient command-line wrapper for the SQL CLUSTER command has remained consistent. It provides database administrators with an essential tool for managing data organization and optimizing query performance, particularly in scenarios where data skew or fragmentation significantly impacts I/O efficiency. Over time, options like --dry-run and --maintenance-db have been added to enhance its usability, provide greater control, and integrate better into automated maintenance scripts.

SEE ALSO

vacuumdb(1), reindexdb(1), pg_dump(1), psql(1)

Copied to clipboard