Real-time data streaming platform.
| Feature | Data Streams | Data Firehose |
|---|---|---|
| Latency | Real-time (~200ms) | Near real-time (60s+) |
| Scaling | Manual (shards) | Automatic |
| Consumers | Custom | S3, Redshift, OpenSearch, Splunk |
| Data Retention | 1-365 days | No retention |
Serverless ETL (Extract, Transform, Load) service.
Serverless interactive query service for S3 data.
Build, secure, and manage data lakes on S3.
Serverless BI (Business Intelligence) and visualization service.
| Format | Type | Best For |
|---|---|---|
| CSV | Row-based | Simple data exchange |
| JSON | Semi-structured | APIs, logs |
| Parquet | Columnar | Analytics queries (Athena, Redshift) |
| ORC | Columnar | Hive workloads |
| Avro | Row-based | Streaming, schema evolution |
| # | Question | Answer |
|---|---|---|
| 1 | Kinesis Data Streams vs Firehose? | Streams: real-time, manual shards, custom consumers. Firehose: near real-time, auto-scaling, delivers to S3/Redshift. |
| 2 | What does AWS Glue do? | Serverless ETL with Data Catalog, Crawlers, and Spark-based jobs |
| 3 | What is Athena? | Serverless SQL queries on S3 data. Pay per TB scanned. |
| 4 | Best data format for Athena? | Parquet or ORC (columnar) with compression |
| 5 | What is Lake Formation? | Build and manage data lakes with centralized access control |
| 6 | What is QuickSight SPICE? | In-memory calculation engine for fast dashboard performance |
| 7 | DataSync vs Storage Gateway? | DataSync: scheduled transfers. Storage Gateway: ongoing hybrid access. |
| 8 | When to use Snowball? | Large data migrations (TBs to PBs) when network transfer is too slow |
| 9 | What does a Glue Crawler do? | Automatically discovers schemas and populates the Data Catalog |
| 10 | Firehose buffer interval range? | 60 to 900 seconds |
A company needs to ingest real-time clickstream data and deliver it to S3 for analytics. The solution must scale automatically and require minimal management. Which service should they use?
Correct: B
Kinesis Data Firehose automatically scales and delivers data to S3 with minimal management. Data Streams requires manual shard management. SQS is a message queue, not designed for streaming to S3. MSK (Kafka) requires more management.
Domain: 3 — Design High-Performing Architectures Task: 3.5
A company stores log data in CSV format on S3 and queries it with Athena. Queries are slow and expensive. What should the architect recommend to improve performance and reduce cost?
Correct: B
Converting to Parquet (columnar format) dramatically reduces the amount of data scanned by Athena, improving both performance and cost. Partitioning further reduces scanned data. Increasing timeout does not improve performance. Moving to RDS changes the architecture unnecessarily. S3 Select helps but Parquet + partitioning is more effective.
Domain: 3 — Design High-Performing Architectures Task: 3.5
A company needs to migrate 50 TB of data from on-premises to S3. Their internet connection is 1 Gbps. Which is the fastest transfer method?
Correct: C
At 1 Gbps, transferring 50 TB would take approximately 5 days over the network. Snowball Edge can be shipped and loaded faster for this data volume. DataSync and multipart upload are limited by the 1 Gbps connection. Direct Connect takes weeks to provision.
Domain: 3 — Design High-Performing Architectures Task: 3.5
A company wants to build a data lake on S3 with centralized access control, including column-level permissions. Which service should they use?
Correct: B
Lake Formation provides centralized access control for data lakes, including fine-grained permissions at the column and row level. S3 bucket policies operate at the object level. Glue Data Catalog is for metadata, not access control. Athena is for querying.
Domain: 3 — Design High-Performing Architectures Task: 3.5
A company needs to process streaming data in real-time using SQL queries. Which service should they use?
Correct: C
Kinesis Data Analytics allows running SQL or Apache Flink on streaming data in real-time. Athena queries data at rest in S3. Redshift is a data warehouse for batch analytics. Glue is for ETL batch processing.
Domain: 3 — Design High-Performing Architectures Task: 3.5