5.3 Error Propagation in Multi-Agent Systems
Error Propagation คืออะไร
ในระบบ multi-agent เมื่อ agent ตัวหนึ่ง error — error นั้นสามารถ “แพร่กระจาย” ไปยัง agents ตัวอื่นที่ depend on ผลลัพธ์ของมัน เหมือน domino ล้ม ตัวหนึ่งผิดทำให้ทั้ง pipeline ผิด Error Propagation เป็นปัญหาที่ร้ายแรงกว่า single-agent error เพราะ:
- ยากที่จะ trace ต้นตอ (root cause ซ่อนอยู่ลึก)
- Error ถูก “แปลง” ระหว่างทาง (agent B interpret error ของ A เป็นอย่างอื่น)
- Recovery ซับซ้อน (ต้อง rollback หลาย agents)
Types of Error Propagation
1. Data Corruption Cascade
Agent A ให้ข้อมูลผิด → Agent B ใช้ข้อมูลผิดนั้นสร้าง output → Agent C ใช้ output ผิดของ B:
# Agent A: Research — returns wrong version number
research = "PostgreSQL 16 supports feature X" # Wrong! It's PG 17+
# Agent B: Design — designs based on wrong info
design = f"We'll use {research}, so our design..." # Propagated error
# Agent C: Implementation — implements wrong design
code = f"Based on design: {design}, implementing..." # Error cascade complete
2. Silent Failure
Agent fails quietly — returns partial/empty results without explicit error:
# Dangerous: agent returns empty but doesn't flag it
def search_agent(query):
results = search(query)
return results # [] — empty, but caller doesn't know if that means "nothing found" or "search failed"
# Safe: explicit error signaling
def search_agent(query):
try:
results = search(query)
return {"status": "success", "results": results, "count": len(results)}
except SearchError as e:
return {"status": "error", "error": str(e), "results": []}
3. Hallucination Amplification
Agent A hallucinates → Agent B treats hallucination as fact → amplifies:
# Agent A hallucinates a function name
"Use the built-in `validate_schema()` function" # Doesn't exist
# Agent B builds on it
"Since we have validate_schema(), we can chain it with..." # Amplified
# Agent C implements it
"import validate_schema..." # Will crash at runtime
Protection Patterns
Circuit Breaker
class CircuitBreaker:
def __init__(self, failure_threshold=3, timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.timeout = timeout
self.state = "closed" # closed = normal, open = blocking
self.last_failure_time = None
def call(self, agent_fn, *args):
if self.state == "open":
if time.time() - self.last_failure_time > self.timeout:
self.state = "half-open" # Try one request
else:
raise CircuitOpenError("Agent circuit is open — too many failures")
try:
result = agent_fn(*args)
if self.state == "half-open":
self.state = "closed"
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.threshold:
self.state = "open"
raise
Validation Gates Between Agents
def pipeline_with_gates(input_data):
# Agent A: Research
research = research_agent(input_data)
# Gate 1: Validate research output
if not validate_research(research):
return {"error": "Research validation failed", "stage": "research"}
# Agent B: Design
design = design_agent(research)
# Gate 2: Validate design against requirements
if not validate_design(design, input_data.requirements):
return {"error": "Design validation failed", "stage": "design"}
# Agent C: Implementation
implementation = implement_agent(design)
return {"success": True, "result": implementation}
Error Boundaries (Isolation)
async def isolated_agent_call(agent_fn, input_data, fallback=None):
"""Run agent in isolation — errors don't propagate"""
try:
result = await agent_fn(input_data)
if result is None:
return fallback
return result
except Exception as e:
log_error(f"Agent failed: {e}")
return fallback
# Use in pipeline
results = await parallel([
lambda: isolated_agent_call(agent_a, data, fallback={"status": "skipped"}),
lambda: isolated_agent_call(agent_b, data, fallback={"status": "skipped"}),
lambda: isolated_agent_call(agent_c, data, fallback={"status": "skipped"}),
])
# Filter out failed agents
successful = [r for r in results if r.get("status") != "skipped"]
Retry with Exponential Backoff
async def retry_agent(agent_fn, input_data, max_retries=3):
for attempt in range(max_retries):
try:
return await agent_fn(input_data)
except (RateLimitError, TimeoutError) as e:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(wait)
Key Concepts
- Fail-fast — detect errors early, before they propagate downstream
- Error boundaries — isolate agents so one failure doesn’t take down the whole system
- Circuit breaker — stop calling a failing agent after N failures; try again after timeout
- Validation gates — verify output between pipeline stages before passing to next agent
- Idempotent retry — safe to retry failed operations without side effects
Exam Tips
- Silent failures เป็นอันตรายที่สุด — agent ต้อง explicitly signal errors ไม่ใช่ return empty
- Circuit breaker pattern: closed → open (after N failures) → half-open (try one) → closed
- ข้อสอบอาจถาม: เมื่อ Agent A ให้ข้อมูลผิด จะเกิดอะไร — ตอบ: cascade failure ถ้าไม่มี validation gate
- Validation gates ระหว่าง agents = ตรวจ output ก่อนส่งต่อ = ป้องกัน propagation
- Hallucination amplification = Claude สร้างข้อมูลผิด แล้ว agent ถัดไปเชื่อและต่อยอด