Kiến thức cần nắm:
Giải thích chi tiết:
Latency Reduction Techniques:
| Technique | Giảm latency | Trade-off |
|---|---|---|
| Streaming | Perceived latency giảm 80%+ | Complexity tăng |
| Smaller models | 2-5x faster | Quality có thể giảm |
| Pre-computation | Near-zero cho cached | Storage cost |
| Parallel requests | Tổng time = max(individual) | Token cost tăng |
| Prompt caching | Giảm TTFT | Cache management |
TTFT (Time to First Token) là metric quan trọng cho user experience:
Parallel Processing Pattern:
Complex Query
├── Sub-query 1 → Model A → Result 1
├── Sub-query 2 → Model B → Result 2 (parallel)
└── Sub-query 3 → Model C → Result 3
↓
Aggregate Results → Final Response
Kiến thức cần nắm:
Giải thích chi tiết:
Vector Search Optimization:
| Optimization | Mô tả |
|---|---|
| Index type | HNSW cho speed, IVF cho memory efficiency |
| Sharding | Distribute data across nodes |
| Dimension reduction | Giảm embedding dimensions |
| Hybrid search | Combine keyword + vector search |
| Re-ranking | Bedrock reranker models cho better relevance |
Hybrid Search:
Query → Keyword Search (BM25) → Top 20 results
↓
Query → Vector Search (semantic) → Top 20 results
↓
Merge + Re-rank → Top 5 final results
Kiến thức cần nắm:
Kiến thức cần nắm:
Temperature controls randomness: 0 = deterministic, 1 = creative. Top-k limits vocabulary choices. Top-p (nucleus sampling) limits cumulative probability. Cho tasks cần accuracy (extraction, classification), dùng temperature thấp. Cho creative tasks, dùng temperature cao hơn.
Parameter Guidelines:
| Use Case | Temperature | Top-p | Max Tokens |
|---|---|---|---|
| Data extraction | 0 | 0.1 | Vừa đủ |
| Classification | 0 | 0.1 | Ngắn |
| Summarization | 0.3 | 0.5 | Trung bình |
| Creative writing | 0.7-1.0 | 0.9 | Dài |
| Code generation | 0.2 | 0.3 | Dài |
| Chatbot | 0.5 | 0.7 | Trung bình |
Kiến thức cần nắm:
Kiến thức cần nắm:
Giải thích chi tiết:
End-to-End Optimization Checklist: