20%

4.5 Batch Processing and Prompt Optimisation

Batch Processing คืออะไร

Message Batches API ให้ส่ง requests จำนวนมากพร้อมกัน โดยได้ราคาลด 50% — แลกกับ latency ที่สูงขึ้น (process ภายใน 24 ชั่วโมง) เหมาะสำหรับ tasks ที่ไม่ต้องการ real-time response เช่น batch classification, bulk content generation, data processing

Message Batches API

import anthropic

client = anthropic.Anthropic()

# สร้าง batch
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "review-001",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 200,
                "messages": [{"role": "user", "content": f"Classify: {review_1}"}]
            }
        },
        {
            "custom_id": "review-002",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 200,
                "messages": [{"role": "user", "content": f"Classify: {review_2}"}]
            }
        },
        # ... hundreds more
    ]
)

# Poll for completion
while batch.processing_status != "ended":
    batch = client.messages.batches.retrieve(batch.id)
    time.sleep(60)

# Get results
results = client.messages.batches.results(batch.id)
for result in results:
    print(f"{result.custom_id}: {result.result.message.content[0].text}")

Prompt Caching

Prompt Caching ลด cost เมื่อมี prefix ที่ซ้ำกัน — เช่น system prompt เดียวกันใช้กับ user messages ต่างกัน:

# First call — cache the system prompt
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[{
        "type": "text",
        "text": long_system_prompt,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "Question 1"}]
)

# Subsequent calls — cached system prompt = 90% cheaper input tokens
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[{
        "type": "text",
        "text": long_system_prompt,  # same text = cache hit
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "Question 2"}]
)

Cost Optimization Strategies

Strategy	Savings	Trade-off
Batches API	50%	24hr latency
Prompt Caching	up to 90% on cached tokens	Min 1024 tokens to cache
Smaller model (Haiku)	~10x cheaper than Opus	Lower capability
Shorter prompts	Linear reduction	May reduce quality

Throughput Optimization

Parallel requests — ใช้ asyncio/threading ส่งหลาย requests พร้อมกัน
Rate limit awareness — respect API rate limits (RPM, TPM)
Batch sizing — Batches API รับ up to 10,000 requests per batch

Exam Tips

Batches API = 50% discount, 24hr processing window
Prompt Caching = ลด cost สำหรับ repeated prefixes (system prompts, few-shot examples)
cache_control: {"type": "ephemeral"} = mark content for caching (5 min TTL)
ข้อสอบอาจถาม: เมื่อไหร่ใช้ Batches vs real-time — Batches = bulk processing ที่ไม่ urgent
Minimum 1024 tokens เพื่อ cache — ถ้า system prompt สั้นกว่านี้ caching ไม่ช่วย
custom_id ใน batch ใช้ map results กลับไปหา original request

← 4.6 Multi-Instance Review and Output Validation 4.4 Few-Shot Prompting →