SageMaker Endpoints¶
Overview¶
Deploy models for real-time inference.
Endpoint Types¶
Real-time Inference¶
Always-on endpoints for low-latency predictions.
from sagemaker.model import Model
model = Model(
image_uri=inference_image,
model_data="s3://bucket/model.tar.gz",
role=role
)
predictor = model.deploy(
initial_instance_count=2,
instance_type="ml.m5.large",
endpoint_name="my-endpoint"
)
# Invoke
response = predictor.predict(data)
Serverless Inference¶
Pay-per-request with automatic scaling.
from sagemaker.serverless import ServerlessInferenceConfig
serverless_config = ServerlessInferenceConfig(
memory_size_in_mb=2048,
max_concurrency=10
)
predictor = model.deploy(
serverless_inference_config=serverless_config,
endpoint_name="serverless-endpoint"
)
Asynchronous Inference¶
For large payloads and long processing times.
from sagemaker.async_inference import AsyncInferenceConfig
async_config = AsyncInferenceConfig(
output_path="s3://bucket/async-output/",
max_concurrent_invocations_per_instance=4,
notification_config={
"SuccessTopic": success_topic_arn,
"ErrorTopic": error_topic_arn
}
)
predictor = model.deploy(
instance_type="ml.m5.large",
initial_instance_count=1,
async_inference_config=async_config
)
Multi-Model Endpoints¶
Host multiple models on single endpoint.
from sagemaker.multidatamodel import MultiDataModel
mme = MultiDataModel(
name="multi-model-endpoint",
model_data_prefix="s3://bucket/models/",
image_uri=inference_image,
role=role
)
predictor = mme.deploy(
initial_instance_count=1,
instance_type="ml.m5.large"
)
# Invoke specific model
predictor.predict(data, target_model="model-a.tar.gz")
Inference Components¶
Share resources across models.
- Copy-based scaling
- Fine-grained resource allocation
- Cost optimization for multiple models
Endpoint Comparison¶
| Feature | Real-time | Serverless | Async |
|---|---|---|---|
| Cold Start | No | Yes | No |
| Max Payload | 6 MB | 6 MB | 1 GB |
| Max Timeout | 60s | 60s | 15 min |
| Scaling | Manual/Auto | Automatic | Manual/Auto |
| Billing | Per hour | Per request | Per hour |
Exam Tips¶
!!! warning "Endpoint Selection" - Real-time: Consistent traffic, low latency required - Serverless: Variable traffic, cost sensitive - Async: Large payloads, long processing - Multi-model: Many similar models, cost optimization