LinuxCommandLibrary

datashader_cli

Render large datasets into images

TLDR

Create a shaded scatter plot of points and save it to a PNG file and set the background color

$ datashader_cli points [path/to/input.parquet] --x [pickup_x] --y [pickup_y] [path/to/output.png] --background [black|white|#rrggbb]
copy

Visualize the geospatial data (supports Geoparquet, shapefile, geojson, geopackage, etc.)
$ datashader_cli points [path/to/input_data.geo.parquet] [path/to/output_data.png] --geo true
copy

Use matplotlib to render the image
$ datashader_cli points [path/to/input_data.geo.parquet] [path/to/output_data.png] --geo [true] --matplotlib true
copy

SYNOPSIS

datashader_cli <command> <input_path> <output_path> [options]

Commands:
  points: Rasterize point data.
  raster: Rasterize pre-aggregated raster data.
  polygons: Rasterize polygon data.
  lines: Rasterize line data.
  mesh: Rasterize mesh data.

PARAMETERS

--width <int>
    Output image width in pixels. (Default: 800)

--height <int>
    Output image height in pixels. (Default: 600)

--x <str>
    Name of the column containing x-coordinates (longitude).

--y <str>
    Name of the column containing y-coordinates (latitude).

--xmin <float>
    Minimum x-coordinate for the spatial extent.

--ymin <float>
    Minimum y-coordinate for the spatial extent.

--xmax <float>
    Maximum x-coordinate for the spatial extent.

--ymax <float>
    Maximum y-coordinate for the spatial extent.

--agg <str>
    Aggregation method (e.g., 'count', 'mean', 'sum', 'min', 'max').

--column <str>
    Column to aggregate for 'mean', 'sum', 'min', 'max' aggregations.

--cmap <str>
    Colormap to apply (e.g., 'fire', 'viridis', 'blues').

--format <str>
    Output format ('png' for image, 'netcdf' for aggregated data). (Default: 'png')

--by <str>
    Column to group data by before aggregation.

--logx
    Apply logarithmic scaling to the x-axis.

--logy
    Apply logarithmic scaling to the y-axis.

--pre-agg
    Indicates that input data is already pre-aggregated (for 'raster' command).

--resample-width <int>
    Resample width for finer control over rasterization.

--resample-height <int>
    Resample height for finer control over rasterization.

--line-width <float>
    Width of lines in pixels (for 'lines' command). (Default: 1.0)

--buffer <float>
    Buffer distance around geometries (for 'polygons', 'lines' commands).

--antialias
    Apply antialiasing to the output image for smoother edges.

DESCRIPTION

The datashader_cli command-line interface provides a convenient way to apply Datashader's powerful rasterization capabilities to large datasets without writing Python code. It is designed to visualize millions or billions of data points, lines, or polygons by aggregating them into a fixed-size raster image. This process makes it possible to render complex datasets that would otherwise overwhelm traditional plotting tools.

It supports various input formats, including CSV and Parquet, and can output visualizations as PNG images or aggregated data in NetCDF format. The CLI offers subcommands for different data types (e.g., points, rasters, polygons, lines), allowing users to specify aggregation methods, colormaps, output dimensions, and spatial extents directly from the terminal.

CAVEATS

datashader_cli is optimized for large datasets, but performance and memory usage can still be significant for extremely large files or complex operations without sufficient system resources. For optimal performance, especially with distributed data, it's recommended to use Parquet files and potentially leverage Dask for out-of-core processing. The CLI currently has a limited set of options compared to the full Python Datashader API, and some advanced customization requires scripting in Python.

INPUT DATA FORMATS

Supports CSV, Parquet, and in some cases, NetCDF files. Ensure your input files are structured appropriately with identifiable columns for X and Y coordinates, and any columns used for aggregation.

OUTPUT IMAGE QUALITY

The quality of the output image (PNG) is determined by the --width and --height parameters. Higher resolutions will result in larger files but may reveal finer details in the data.

HISTORY

The datashader_cli was developed as part of the Datashader project, initiated by Anaconda Inc. to address the growing challenge of visualizing very large datasets. While Datashader primarily offers a Python API for programmatic use, the CLI was introduced to provide a more accessible entry point for users who need to quickly generate visualizations without writing Python code. It emerged to streamline common rasterization workflows, making it easier to integrate Datashader into shell scripts or data pipelines where a direct command-line interface is preferred.

SEE ALSO

datashader (Python library), bokeh(1), holoviews(1), pandas(1), dask(1), geopandas(1)

Copied to clipboard