Skip to content

SageMaker Training

Overview

Amazon SageMaker provides managed infrastructure for training ML models.

Training Job Components

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri=training_image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path="s3://bucket/output/",
    hyperparameters={
        "epochs": 10,
        "learning_rate": 0.01
    }
)

estimator.fit({
    "train": "s3://bucket/train/",
    "validation": "s3://bucket/validation/"
})

Instance Types

Category Instance Types Use Case
General Purpose ml.m5.* Balanced compute
Compute Optimized ml.c5.* CPU-intensive
GPU ml.p3., ml.g4dn. Deep learning
Memory Optimized ml.r5.* Large datasets

Training Modes

File Mode

  • Data downloaded to instance
  • Best for iterative access
  • Default mode

Pipe Mode

  • Data streamed from S3
  • No download wait time
  • Best for large datasets

FastFile Mode

  • POSIX-compliant access
  • Lazy loading
  • Best for random access patterns

Distributed Training

Data Parallelism

Split data across instances, each has full model copy.

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train.py",
    instance_count=4,
    instance_type="ml.p3.16xlarge",
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}}
)

Model Parallelism

Split model across instances for large models.

distribution={
    "smdistributed": {
        "modelparallel": {
            "enabled": True,
            "parameters": {"partitions": 2}
        }
    }
}

Spot Training

Save up to 90% on training costs.

estimator = Estimator(
    ...
    use_spot_instances=True,
    max_wait=3600,  # Maximum wait time
    max_run=1800,   # Maximum run time
    checkpoint_s3_uri="s3://bucket/checkpoints/"
)

Exam Tips

!!! warning "Key Points" - Use Pipe Mode for large datasets - Use Spot Instances with checkpointing for cost savings - Data Parallelism for scaling with batch size - Model Parallelism for models that don't fit in memory