Skip to content

SageMaker Endpoints

Overview

Deploy models for real-time inference.

Endpoint Types

Real-time Inference

Always-on endpoints for low-latency predictions.

from sagemaker.model import Model

model = Model(
    image_uri=inference_image,
    model_data="s3://bucket/model.tar.gz",
    role=role
)

predictor = model.deploy(
    initial_instance_count=2,
    instance_type="ml.m5.large",
    endpoint_name="my-endpoint"
)

# Invoke
response = predictor.predict(data)

Serverless Inference

Pay-per-request with automatic scaling.

from sagemaker.serverless import ServerlessInferenceConfig

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=10
)

predictor = model.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name="serverless-endpoint"
)

Asynchronous Inference

For large payloads and long processing times.

from sagemaker.async_inference import AsyncInferenceConfig

async_config = AsyncInferenceConfig(
    output_path="s3://bucket/async-output/",
    max_concurrent_invocations_per_instance=4,
    notification_config={
        "SuccessTopic": success_topic_arn,
        "ErrorTopic": error_topic_arn
    }
)

predictor = model.deploy(
    instance_type="ml.m5.large",
    initial_instance_count=1,
    async_inference_config=async_config
)

Multi-Model Endpoints

Host multiple models on single endpoint.

from sagemaker.multidatamodel import MultiDataModel

mme = MultiDataModel(
    name="multi-model-endpoint",
    model_data_prefix="s3://bucket/models/",
    image_uri=inference_image,
    role=role
)

predictor = mme.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large"
)

# Invoke specific model
predictor.predict(data, target_model="model-a.tar.gz")

Inference Components

Share resources across models.

  • Copy-based scaling
  • Fine-grained resource allocation
  • Cost optimization for multiple models

Endpoint Comparison

Feature Real-time Serverless Async
Cold Start No Yes No
Max Payload 6 MB 6 MB 1 GB
Max Timeout 60s 60s 15 min
Scaling Manual/Auto Automatic Manual/Auto
Billing Per hour Per request Per hour

Exam Tips

!!! warning "Endpoint Selection" - Real-time: Consistent traffic, low latency required - Serverless: Variable traffic, cost sensitive - Async: Large payloads, long processing - Multi-model: Many similar models, cost optimization