Skip to content

AWS Glue

Overview

Serverless data integration service for ETL and data cataloging.

Components

graph TD
    A[Data Sources] --> B[Crawlers]
    B --> C[Data Catalog]
    C --> D[ETL Jobs]
    D --> E[Target Data Store]

    F[DataBrew] --> E
    G[Data Quality] --> C

AWS Glue Data Catalog

Centralized metadata repository.

Component Description
Databases Logical grouping of tables
Tables Metadata definition
Partitions Data organization
Connections Data source credentials

Crawlers

Automatically discover schema.

import boto3

glue = boto3.client("glue")

glue.create_crawler(
    Name="my-crawler",
    Role="AWSGlueServiceRole",
    DatabaseName="my-database",
    Targets={
        "S3Targets": [{"Path": "s3://my-bucket/data/"}]
    },
    Schedule="cron(0 0 * * ? *)"  # Daily
)

ETL Jobs

Spark-based data transformation.

# Glue ETL script
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

# Read from catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_db",
    table_name="my_table"
)

# Transform
transformed = datasource.apply_mapping([
    ("old_col", "string", "new_col", "string")
])

# Write
glueContext.write_dynamic_frame.from_options(
    frame=transformed,
    connection_type="s3",
    connection_options={"path": "s3://output/"},
    format="parquet"
)

AWS Glue DataBrew

Visual data preparation.

Feature Description
Datasets Data sources
Projects Interactive preparation
Recipes Transformation steps
Jobs Execute recipes

AWS Glue Data Quality

Automated data quality checks.

# Data quality rules
rules = """
Rules = [
    IsComplete "column_name",
    Uniqueness "id" > 0.99,
    ColumnValues "age" between 0 and 120
]
"""

Job Types

Type Use Case
Spark Large-scale ETL
Python Shell Simple scripts
Streaming Real-time ETL
Ray Distributed Python

Exam Focus Areas

!!! warning "Key Topics" - Crawlers for schema discovery - Data Catalog for metadata - ETL jobs for transformation - DataBrew for visual preparation - Data Quality for validation