Skip to content

Data Ingestion

Overview

Data ingestion is the process of moving data from various sources into AWS for ML workloads.

Key Services

Amazon S3

Primary storage for ML data. Key concepts:

  • Storage classes: Standard, Intelligent-Tiering, Glacier
  • Versioning for data lineage
  • S3 Select for querying data in place
  • Transfer Acceleration for faster uploads

Amazon Kinesis

Real-time streaming data ingestion.

Service Use Case
Kinesis Data Streams Custom real-time processing
Kinesis Data Firehose Managed delivery to S3/Redshift
Kinesis Data Analytics SQL on streaming data

AWS Glue

Serverless ETL and data cataloging.

  • Crawlers: Automatically discover schema
  • Data Catalog: Centralized metadata repository
  • ETL Jobs: Transform data with Spark

AWS DataSync

Automated data transfer from on-premises to AWS.

Best Practices

Data Organization

Use a consistent folder structure in S3: s3://bucket/ ├── raw/ # Original data ├── processed/ # Cleaned data ├── features/ # Feature store data └── models/ # Trained models

Exam Tips

!!! warning "Common Exam Scenarios" - Streaming data → Kinesis Data Firehose to S3 - On-premises large datasets → DataSync - Schema discovery → Glue Crawlers - Cross-region replication → S3 Replication