1.1.10 Streaming Data
Handle Streaming Data Using AWS Services
Kinesis Data Streams vs Kinesis Data Firehose
| Feature | Data Streams | Data Firehose |
|---|
| Latency | Real-time (~200ms) | Near real-time (60s buffer min) |
| Scaling | Manual (add/remove shards) | Auto-scaling |
| Consumers | Custom (Lambda, KCL, SDK) | S3, Redshift, OpenSearch, HTTP |
| Retention | 24h → 365 days | Không lưu trữ (delivery only) |
| Replay | Có | Không |
| Provisioning | On-demand hoặc Provisioned | Không cần quản lý |
Kinesis Data Streams
Shard Capacity
| Metric | Per Shard |
|---|
| Write | 1 MB/s hoặc 1000 records/s |
| Read (shared) | 2 MB/s |
| Read (enhanced fan-out) | 2 MB/s per consumer |
Consumer Types
import boto3
kinesis = boto3.client('kinesis')
# Shared throughput consumer (pull model)
response = kinesis.get_records(
ShardIterator=shard_iterator,
Limit=100
)
# Enhanced fan-out — push model, 2MB/s per consumer
# Dùng SubscribeToShard API
Lambda as Consumer
- Event source mapping: Lambda polls Kinesis
- Batch size: 1-10,000 records
- Batch window: 0-300 seconds
- Parallelization factor: 1-10 (concurrent batches per shard)
- On failure: Bisect batch, retry, hoặc send to DLQ
Kinesis Data Firehose
Delivery Destinations
- Amazon S3
- Amazon Redshift (qua S3)
- Amazon OpenSearch Service
- HTTP Endpoint (Datadog, Splunk, etc.)
Buffer Settings
| Setting | Min | Max |
|---|
| Buffer Size | 1 MB | 128 MB |
| Buffer Interval | 60s | 900s |
- Firehose delivers khi 1 trong 2 điều kiện đạt trước
- Data transformation: Lambda function (optional)
DynamoDB Streams
- Capture item-level changes (INSERT, MODIFY, REMOVE)
- Retention: 24 hours
- Lambda trigger cho real-time processing
- Stream view types:
KEYS_ONLY, NEW_IMAGE, OLD_IMAGE, NEW_AND_OLD_IMAGES
Exam Tip: Real-time = Kinesis Data Streams. Near real-time delivery to S3/Redshift = Firehose. Cần replay = Data Streams. DynamoDB Streams cho change data capture (CDC).