Skip to content

Data Transformation

Overview

Transforming raw data into formats suitable for ML model training.

Key Services

AWS Glue ETL

Serverless Apache Spark for data transformation.

# Example Glue ETL script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Read from Data Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="mydb",
    table_name="mytable"
)

# Apply transformations
transformed = ApplyMapping.apply(
    frame=datasource,
    mappings=[
        ("old_col", "string", "new_col", "string"),
    ]
)

# Write to S3
glueContext.write_dynamic_frame.from_options(
    frame=transformed,
    connection_type="s3",
    connection_options={"path": "s3://bucket/output/"},
    format="parquet"
)

AWS Glue DataBrew

Visual data preparation tool for non-coders.

  • 250+ built-in transformations
  • Profile data quality
  • Recipe-based transformations

Amazon EMR

Managed Hadoop/Spark for large-scale processing.

Use Case When to Use
Glue Serverless, simple ETL
EMR Complex processing, custom libraries
DataBrew Visual, no-code preparation

SageMaker Processing

Run data processing jobs with custom containers.

from sagemaker.processing import ScriptProcessor

processor = ScriptProcessor(
    role=role,
    image_uri=image_uri,
    instance_count=1,
    instance_type="ml.m5.xlarge"
)

processor.run(
    code="preprocessing.py",
    inputs=[ProcessingInput(source="s3://input/", destination="/opt/ml/processing/input")],
    outputs=[ProcessingOutput(source="/opt/ml/processing/output", destination="s3://output/")]
)

Common Transformations

  • Handling missing values
  • Encoding categorical variables (one-hot, label encoding)
  • Feature scaling (normalization, standardization)
  • Text tokenization and vectorization
  • Image resizing and augmentation