What is AWS Glue?

AWS Glue is Amazon's serverless data-integration service. It bundles three things that usually live apart: the Glue Data Catalog, a managed catalog compatible with the Hive metastore¹; crawlers that scan storage and infer schemas; and serverless Spark² and Python jobs that run ETL³ without you provisioning a cluster. You point it at data sitting in a data lake⁴ on S3, and it gives you a queryable catalog and a place to run transformations, with no servers to patch and no metastore to babysit.

🔗 Learn more — ¹ What is the Hive metastore?

🔗 Learn more — ² What is Apache Spark?

🔗 Learn more — ³ What is ETL (and how is ELT different)?

🔗 Learn more — ⁴ What is a data lake?

The Data Catalog is the real backbone

If you take one thing from Glue, take the Data Catalog. It is a central store of table definitions: schemas, column types, partitions, and the S3 locations behind them. Because it speaks the Hive metastore protocol, it acts as the shared source of truth for the rest of the AWS analytics stack. Athena reads it to know what tables exist when you write SQL against S3. Redshift⁵ Spectrum uses it to query external tables. EMR can point Spark or Hive at it instead of running its own metastore. Register a table once, and every engine sees the same definition.

🔗 Learn more — ⁵ What is Amazon Redshift?

This is why, in practice, most teams use the catalog far more than they use Glue's ETL. A data catalog that every query engine⁶ already trusts is genuinely useful infrastructure, and it removes the chore of keeping separate metastores in sync. The lock-in is real but mild here: the catalog is the thing you would miss most if you left AWS, and also the thing easiest to live with.

🔗 Learn more — ⁶ What is a query engine (Trino, Presto, and friends)?

Crawlers, the convenient part that bites

Crawlers are Glue's schema-inference robots. You aim one at an S3 prefix and it walks the files, guesses column types, detects partitions, and writes table definitions into the catalog. For a tidy data lake of well-formed Parquet⁷, this is a genuine time-saver: schemas appear, partitions register, and Athena can query immediately.

🔗 Learn more — ⁷ How Parquet works: columnar storage explained

The honesty: crawlers misbehave on messy data. They infer types you did not want, split one logical table into several when file layouts drift, fight you over partition detection, and occasionally rewrite a schema you had carefully fixed by hand. Many seasoned teams run a crawler once to bootstrap a table, then disable it and manage the schema explicitly, because a crawler that silently changes your catalog is worse than no crawler at all.

Serverless Spark and Python jobs

Glue's compute side runs ETL as managed Apache Spark. You write a job in PySpark or Scala, or use Glue Studio's visual editor to build a transformation graph without code, and Glue allocates the workers, runs it, and tears the cluster down. There is also a lighter Python-shell⁸ job type for small tasks that do not need a distributed engine. The appeal is obvious: no cluster to size, no Spark version to maintain, billing by the second of compute used.

🔗 Learn more — ⁸ What is a shell?

flowchart TD
    S3["Data lake on S3"] --> CR["Crawler infers schema"]
    CR --> CAT["Glue Data Catalog"]
    CAT --> ATH["Athena / Redshift Spectrum / EMR"]
    JOB["Serverless Spark / Python job"] --> S3
    CAT --> JOB

    classDef core stroke:#a3be8c,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    %% color = green: the catalog, the part most people actually use; grey: everything else
    class CAT core
    class S3,CR,ATH,JOB plain

The honest cost picture: serverless Spark is convenient, not cheap or instant. Jobs carry cold-start latency while Glue spins up workers, so a small job can spend more time starting than running, and the per-DPU-hour pricing adds up for heavy or frequent workloads. For steady high-volume ETL, a dedicated cluster or a non-Spark engine can be cheaper and faster. Glue earns its keep when your jobs are intermittent and you value not running infrastructure over squeezing out the last dollar.

Where Glue fits, fairly

Glue is convenient if you are already all-in on AWS. It removes catalog and cluster operations, plugs straight into Athena, Redshift, and EMR, and lets a small team stand up a working data lake quickly. Those are real wins. The flip side is equally real: it ties you to AWS, the crawlers need supervision, and serverless Spark trades cost and latency for operational ease. The pragmatic read is that the Data Catalog is the durable, broadly useful piece, the crawlers are a starting convenience to outgrow, and the serverless jobs are fine for bursty work but not automatically the cheapest way to run ETL.

The short version: AWS Glue is a serverless catalog plus serverless Spark glued to S3. Lean on the catalog, distrust the crawlers, and price the Spark jobs before betting your pipeline on them.

The Data Catalog is the real backbone

Crawlers, the convenient part that bites

Serverless Spark and Python jobs

Where Glue fits, fairly

Sources