Lab 04: Endpoint Deployment¶
Domain: 3 - Deployment & Orchestration
Difficulty: Medium
Time: 45 minutes
Objective¶
Deploy models using different SageMaker endpoint types.
Prerequisites¶
- Trained model in S3 (from Lab 02)
- SageMaker execution role
Steps¶
Step 1: Real-time Endpoint¶
from sagemaker.model import Model
# Create model
model = Model(
image_uri=xgb_image,
model_data=model_artifact,
role=role,
sagemaker_session=sagemaker_session
)
# Deploy real-time endpoint
predictor = model.deploy(
initial_instance_count=1,
instance_type='ml.m5.large',
endpoint_name='xgb-realtime-endpoint'
)
# Test prediction
import numpy as np
test_data = np.array([[0.5, 0.3, 0.2, 0.4, 0.1, 0.6, 0.7, 0.8]])
result = predictor.predict(test_data)
print(f"Prediction: {result}")
Step 2: Serverless Endpoint¶
from sagemaker.serverless import ServerlessInferenceConfig
serverless_config = ServerlessInferenceConfig(
memory_size_in_mb=2048,
max_concurrency=5
)
# Deploy serverless endpoint
serverless_predictor = model.deploy(
serverless_inference_config=serverless_config,
endpoint_name='xgb-serverless-endpoint'
)
# Test (note: may have cold start delay)
result = serverless_predictor.predict(test_data)
print(f"Serverless prediction: {result}")
Step 3: Async Endpoint¶
from sagemaker.async_inference import AsyncInferenceConfig
async_config = AsyncInferenceConfig(
output_path=f's3://{bucket}/{prefix}/async-output/',
max_concurrent_invocations_per_instance=4
)
# Deploy async endpoint
async_predictor = model.deploy(
initial_instance_count=1,
instance_type='ml.m5.large',
async_inference_config=async_config,
endpoint_name='xgb-async-endpoint'
)
Step 4: Invoke Async Endpoint¶
import json
# Upload input to S3
input_data = json.dumps(test_data.tolist())
input_key = f'{prefix}/async-input/input.json'
s3_client = boto3.client('s3')
s3_client.put_object(Bucket=bucket, Key=input_key, Body=input_data)
# Invoke async
sm_runtime = boto3.client('sagemaker-runtime')
response = sm_runtime.invoke_endpoint_async(
EndpointName='xgb-async-endpoint',
InputLocation=f's3://{bucket}/{input_key}',
ContentType='application/json'
)
output_location = response['OutputLocation']
print(f"Output will be at: {output_location}")
Step 5: Compare Endpoints¶
| Endpoint Type | Latency | Cost | Use Case |
|---|---|---|---|
| Real-time | Low | High (always-on) | Interactive apps |
| Serverless | Variable | Low (pay-per-use) | Variable traffic |
| Async | High | Medium | Large payloads |
Verification¶
- All three endpoints deployed successfully
- Predictions returned correctly
- Understand trade-offs between endpoint types
Cleanup¶
# Delete all endpoints
for endpoint in ['xgb-realtime-endpoint', 'xgb-serverless-endpoint', 'xgb-async-endpoint']:
try:
sm_client.delete_endpoint(EndpointName=endpoint)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint)
except:
pass
# Delete model
sm_client.delete_model(ModelName=model.name)
Key Takeaways¶
!!! note "Exam Points" - Real-time: Consistent, low-latency needs - Serverless: Cost optimization, variable traffic - Async: Large payloads, long processing - Always clean up endpoints to avoid charges