Skip to content

Data Storage

Overview

Choosing the right storage solutions for ML data.

Amazon S3

Primary data lake storage for ML.

Storage Classes

Class Use Case Retrieval
Standard Frequent access Immediate
Intelligent-Tiering Unknown access patterns Automatic
Standard-IA Infrequent access Immediate
Glacier Archive Minutes to hours
Glacier Deep Archive Long-term archive Hours

S3 for ML Best Practices

  • Use Parquet/ORC for columnar analytics
  • Enable versioning for data lineage
  • Use S3 Select for filtering at source
  • Configure lifecycle policies for cost optimization

AWS Lake Formation

Centralized data lake management.

Key Features

  • Fine-grained access control
  • Data sharing across accounts
  • Built-in data catalog integration
  • Row/column-level security
# Grant permissions
lake_formation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::account:role/role'},
    Resource={'Table': {'DatabaseName': 'db', 'Name': 'table'}},
    Permissions=['SELECT']
)

Amazon Athena

Serverless SQL queries on S3 data.

-- Query data directly in S3
SELECT customer_id, COUNT(*) as order_count
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id

Athena for ML

  • Query training data without loading into memory
  • Create datasets from complex joins
  • Partition data for efficient queries

Data Formats Comparison

Format Best For Compression Schema
Parquet Analytics, columnar High Embedded
CSV Simple data Low External
JSON Semi-structured Medium Flexible
Avro Streaming Medium Embedded
RecordIO SageMaker Medium External

Exam Tips

!!! warning "Storage Decisions" - S3 + Athena for ad-hoc queries - Lake Formation for governed data access - Parquet for most ML workloads - RecordIO for large-scale SageMaker training