aws-glue
Manage serverless ETL jobs and data catalog
TLDR
Create a crawler to discover data schema
SYNOPSIS
aws glue command [options]
DESCRIPTION
aws glue is the AWS CLI interface for AWS Glue, a serverless data integration service for ETL (extract, transform, load) workloads. Glue discovers, prepares, and combines data for analytics, machine learning, and application development.
Key components include the Data Catalog (central metadata repository), Crawlers (automatic schema discovery), Jobs (ETL scripts in Python or Scala), and Triggers (job orchestration). Glue integrates with S3, Redshift, RDS, and other data stores.
COMMANDS
create-crawler
Create a crawler for schema discoverystart-crawler
Run a crawler to populate the catalogget-databases
List databases in the Data Catalogget-tables
List tables in a databaseget-table
Get schema details for a tablecreate-job
Create an ETL jobstart-job-run
Execute a jobget-job-run
Check job run statuscreate-trigger
Create a job triggerget-crawlers
List all crawlers
CAVEATS
Crawlers can take significant time on large datasets. Job cold start adds latency; use job bookmarks for incremental processing. DPU (Data Processing Unit) costs accumulate during job runs. The Data Catalog has limits on tables per database (200,000).
HISTORY
AWS Glue launched in August 2017 as a serverless ETL service. Glue Studio for visual ETL authoring came in 2020. Data Quality features were added in 2022, and Glue for Ray (distributed Python) launched in 2023 for data science workloads.
SEE ALSO
aws(1), aws-athena(1), aws-s3(1), aws-redshift(1)
