LinuxCommandLibrary

aws-glue

Manage AWS Glue resources

TLDR

List jobs

$ aws glue list-jobs
copy

Start a job
$ aws glue start-job-run --job-name [job_name]
copy

Start running a workflow
$ aws glue start-workflow-run --name [workflow_name]
copy

List triggers
$ aws glue list-triggers
copy

Start a trigger
$ aws glue start-trigger --name [trigger_name]
copy

Create a dev endpoint
$ aws glue create-dev-endpoint --endpoint-name [name] --role-arn [role_arn_used_by_endpoint]
copy

SYNOPSIS

aws glue [global-options] subcommand [subcommand-options]

PARAMETERS

--aws-access-key-id
    AWS access key ID.

--aws-secret-access-key
    AWS secret access key.

--aws-session-token
    AWS session token.

--ca-bundle
    CA bundle for SSL verification.

--cli-auto-prompt
    Automatically prompt for CLI input.

--cli-binary-format raw|base64
    Binary format for input/output.

--cli-connect-timeout
    Connection timeout in seconds.

--cli-read-timeout
    Read timeout in seconds.

--debug
    Enable debug logging.

--endpoint-url
    Override service endpoint URL.

--max-items
    Maximum items to return.

--no-cli-pager
    Disable cli paging.

--no-paginate
    Disable automatic pagination.

--no-sign-request
    Do not sign requests.

--output json|text|table
    Output format.

--page-size
    Page size for paginated results.

--profile
    Named profile from credentials file.

--region
    AWS region (e.g., us-east-1).

--region-set
    List of regions to try.

--no-verify-ssl
    Disable SSL certificate verification.

DESCRIPTION

The aws glue command is a subcommand of the AWS Command Line Interface (CLI) used to manage AWS Glue, a serverless data integration service for ETL (extract, transform, load) workloads. It enables automation of data cataloging, job orchestration, crawling, and schema discovery across data stores like Amazon S3, RDS, DynamoDB, and Redshift.

AWS Glue automatically discovers, catalogs, and cleans data, making it available for querying with Athena or analysis in Redshift Spectrum. The CLI supports creating and managing jobs (Python/Spark scripts), crawlers (schema inference), triggers, workflows, databases, tables, partitions, classifiers, and development endpoints.

Operations include batch actions for efficiency, job monitoring with run history, and integration with Lake Formation for governance. Output formats include JSON, text, or table; pagination is automatic.

Prerequisites: Install AWS CLI v2 (curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"; unzip awscliv2.zip; ./aws/install), configure credentials (aws configure), and attach IAM policies like AWSGlueServiceRole. Use with --dry-run for testing.

Common use cases: ETL pipelines, data lake building, ML feature stores. Scales serverlessly, charges per DPU-hour.

CAVEATS

Requires AWS CLI installed and configured; IAM permissions essential (e.g., glue:CreateJob). Rate limits apply; use --dry-run. Not for interactive use—script-friendly.

COMMON SUBCOMMANDS

create-job, get-job, start-job-run, create-crawler, start-crawler, get-database, batch-create-partition, list-jobs.

EXAMPLE USAGE

aws glue create-job --job-name my-etl --role ARN:aws:iam::123:role/GlueRole --command Name=glueetl,ScriptLocation=s3://bucket/script.py
aws glue start-job-run --job-name my-etl
aws glue get-crawler --name my-crawler

HISTORY

Launched August 2017 with AWS Glue service. AWS CLI v1 initial support; v2 (2019+) added binary JSON, faster performance. Evolving with Glue 4.0 (2023) for Spark 3.3, Ray.

SEE ALSO

aws(1), aws s3(1), aws emr(1), aws athena(1)

Copied to clipboard