Domain 1: Data Preparation for Machine Learning¶
Weight: 28% of scored content
This domain covers the foundational skills needed to prepare data for machine learning workloads on AWS.
Topics Covered¶
| Topic | Description |
|---|---|
| Data Ingestion | S3, Kinesis, Glue, DataSync |
| Data Transformation | Glue ETL, DataBrew, EMR, Spark |
| Data Validation | Glue Data Quality, data integrity |
| Feature Engineering | SageMaker Feature Store, Processing |
| Data Storage | Lake Formation, Athena, data formats |
Key Concepts¶
Data Ingestion Patterns¶
graph LR
A[Data Sources] --> B{Ingestion Method}
B --> C[Batch: S3, Glue]
B --> D[Streaming: Kinesis, Firehose]
B --> E[On-premises: DataSync]
C --> F[Data Lake / S3]
D --> F
E --> F
Common Data Formats for ML¶
| Format | Use Case | Pros |
|---|---|---|
| Parquet | Columnar analytics | Compression, fast reads |
| CSV | Simple tabular data | Universal compatibility |
| JSON/JSONL | Semi-structured data | Flexibility |
| RecordIO | SageMaker training | Optimized for streaming |
| TFRecord | TensorFlow training | TensorFlow native |
Study Checklist¶
- Understand S3 storage classes and lifecycle policies
- Know Glue components: Crawlers, Data Catalog, Jobs
- Understand Kinesis Data Streams vs Data Firehose
- Know SageMaker Feature Store online vs offline
- Understand data validation and quality checks