Skip to content

Lab 02: SageMaker Training Job

Domain: 2 - Model Development
Difficulty: Medium
Time: 45 minutes

Objective

Train a model using SageMaker built-in XGBoost algorithm.

Prerequisites

  • AWS CLI configured
  • Python with boto3 and sagemaker SDK
  • S3 bucket for data and output

Steps

Step 1: Prepare Data

import boto3
import sagemaker
import pandas as pd
from sklearn.model_selection import train_test_split

# Download sample data
!wget https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/main/introduction_to_amazon_algorithms/xgboost_abalone/abalone.csv

# Load and prepare
df = pd.read_csv('abalone.csv', header=None)

# Split data
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

# Save to CSV (target column first for XGBoost)
train_df.to_csv('train.csv', index=False, header=False)
val_df.to_csv('validation.csv', index=False, header=False)
test_df.to_csv('test.csv', index=False, header=False)

Step 2: Upload to S3

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'xgboost-abalone'

train_path = sagemaker_session.upload_data('train.csv', bucket=bucket, key_prefix=f'{prefix}/train')
val_path = sagemaker_session.upload_data('validation.csv', bucket=bucket, key_prefix=f'{prefix}/validation')

Step 3: Configure Training Job

from sagemaker import image_uris
from sagemaker.estimator import Estimator

# Get XGBoost container
region = sagemaker_session.boto_region_name
xgb_image = image_uris.retrieve('xgboost', region, version='1.5-1')

# Get execution role
role = sagemaker.get_execution_role()

# Create estimator
xgb_estimator = Estimator(
    image_uri=xgb_image,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{bucket}/{prefix}/output',
    sagemaker_session=sagemaker_session
)

# Set hyperparameters
xgb_estimator.set_hyperparameters(
    objective='reg:squarederror',
    num_round=100,
    max_depth=5,
    eta=0.1,
    subsample=0.8,
    colsample_bytree=0.8
)

Step 4: Start Training

from sagemaker.inputs import TrainingInput

train_input = TrainingInput(train_path, content_type='text/csv')
val_input = TrainingInput(val_path, content_type='text/csv')

xgb_estimator.fit({
    'train': train_input,
    'validation': val_input
})

Step 5: Verify Training

# Check training job status
training_job_name = xgb_estimator.latest_training_job.name
print(f"Training job: {training_job_name}")

# Model artifact location
model_artifact = xgb_estimator.model_data
print(f"Model artifact: {model_artifact}")

Verification

  • Training job completed successfully
  • Model artifact saved to S3
  • Training metrics available in CloudWatch

Cleanup

# Delete training artifacts (optional)
import boto3
s3 = boto3.resource('s3')
bucket_obj = s3.Bucket(bucket)
bucket_obj.objects.filter(Prefix=prefix).delete()

Key Takeaways

!!! note "Exam Points" - XGBoost expects target column first - Use TrainingInput for data channels - Model artifacts saved to S3 automatically - Training logs available in CloudWatch