1.1.10 Streaming Data

Handle Streaming Data Using AWS Services

Kinesis Data Streams vs Kinesis Data Firehose

FeatureData StreamsData Firehose
LatencyReal-time (~200ms)Near real-time (60s buffer min)
ScalingManual (add/remove shards)Auto-scaling
ConsumersCustom (Lambda, KCL, SDK)S3, Redshift, OpenSearch, HTTP
Retention24h → 365 daysKhông lưu trữ (delivery only)
ReplayKhông
ProvisioningOn-demand hoặc ProvisionedKhông cần quản lý

Kinesis Data Streams

Shard Capacity

MetricPer Shard
Write1 MB/s hoặc 1000 records/s
Read (shared)2 MB/s
Read (enhanced fan-out)2 MB/s per consumer

Consumer Types

import boto3

kinesis = boto3.client('kinesis')

# Shared throughput consumer (pull model)
response = kinesis.get_records(
    ShardIterator=shard_iterator,
    Limit=100
)

# Enhanced fan-out — push model, 2MB/s per consumer
# Dùng SubscribeToShard API

Lambda as Consumer

  • Event source mapping: Lambda polls Kinesis
  • Batch size: 1-10,000 records
  • Batch window: 0-300 seconds
  • Parallelization factor: 1-10 (concurrent batches per shard)
  • On failure: Bisect batch, retry, hoặc send to DLQ

Kinesis Data Firehose

Delivery Destinations

  • Amazon S3
  • Amazon Redshift (qua S3)
  • Amazon OpenSearch Service
  • HTTP Endpoint (Datadog, Splunk, etc.)

Buffer Settings

SettingMinMax
Buffer Size1 MB128 MB
Buffer Interval60s900s
  • Firehose delivers khi 1 trong 2 điều kiện đạt trước
  • Data transformation: Lambda function (optional)

DynamoDB Streams

  • Capture item-level changes (INSERT, MODIFY, REMOVE)
  • Retention: 24 hours
  • Lambda trigger cho real-time processing
  • Stream view types: KEYS_ONLY, NEW_IMAGE, OLD_IMAGE, NEW_AND_OLD_IMAGES

Exam Tip: Real-time = Kinesis Data Streams. Near real-time delivery to S3/Redshift = Firehose. Cần replay = Data Streams. DynamoDB Streams cho change data capture (CDC).