15%

5.3 Error Propagation in Multi-Agent Systems

Error Propagation คืออะไร

ในระบบ multi-agent เมื่อ agent ตัวหนึ่ง error — error นั้นสามารถ “แพร่กระจาย” ไปยัง agents ตัวอื่นที่ depend on ผลลัพธ์ของมัน เหมือน domino ล้ม ตัวหนึ่งผิดทำให้ทั้ง pipeline ผิด Error Propagation เป็นปัญหาที่ร้ายแรงกว่า single-agent error เพราะ:

  1. ยากที่จะ trace ต้นตอ (root cause ซ่อนอยู่ลึก)
  2. Error ถูก “แปลง” ระหว่างทาง (agent B interpret error ของ A เป็นอย่างอื่น)
  3. Recovery ซับซ้อน (ต้อง rollback หลาย agents)

Types of Error Propagation

1. Data Corruption Cascade

Agent A ให้ข้อมูลผิด → Agent B ใช้ข้อมูลผิดนั้นสร้าง output → Agent C ใช้ output ผิดของ B:

# Agent A: Research — returns wrong version number
research = "PostgreSQL 16 supports feature X"  # Wrong! It's PG 17+

# Agent B: Design — designs based on wrong info
design = f"We'll use {research}, so our design..."  # Propagated error

# Agent C: Implementation — implements wrong design
code = f"Based on design: {design}, implementing..."  # Error cascade complete

2. Silent Failure

Agent fails quietly — returns partial/empty results without explicit error:

# Dangerous: agent returns empty but doesn't flag it
def search_agent(query):
    results = search(query)
    return results  # [] — empty, but caller doesn't know if that means "nothing found" or "search failed"

# Safe: explicit error signaling
def search_agent(query):
    try:
        results = search(query)
        return {"status": "success", "results": results, "count": len(results)}
    except SearchError as e:
        return {"status": "error", "error": str(e), "results": []}

3. Hallucination Amplification

Agent A hallucinates → Agent B treats hallucination as fact → amplifies:

# Agent A hallucinates a function name
"Use the built-in `validate_schema()` function"  # Doesn't exist

# Agent B builds on it
"Since we have validate_schema(), we can chain it with..."  # Amplified

# Agent C implements it
"import validate_schema..."  # Will crash at runtime

Protection Patterns

Circuit Breaker

class CircuitBreaker:
    def __init__(self, failure_threshold=3, timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.timeout = timeout
        self.state = "closed"  # closed = normal, open = blocking
        self.last_failure_time = None
    
    def call(self, agent_fn, *args):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "half-open"  # Try one request
            else:
                raise CircuitOpenError("Agent circuit is open — too many failures")
        
        try:
            result = agent_fn(*args)
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.threshold:
                self.state = "open"
            raise

Validation Gates Between Agents

def pipeline_with_gates(input_data):
    # Agent A: Research
    research = research_agent(input_data)
    
    # Gate 1: Validate research output
    if not validate_research(research):
        return {"error": "Research validation failed", "stage": "research"}
    
    # Agent B: Design
    design = design_agent(research)
    
    # Gate 2: Validate design against requirements
    if not validate_design(design, input_data.requirements):
        return {"error": "Design validation failed", "stage": "design"}
    
    # Agent C: Implementation
    implementation = implement_agent(design)
    
    return {"success": True, "result": implementation}

Error Boundaries (Isolation)

async def isolated_agent_call(agent_fn, input_data, fallback=None):
    """Run agent in isolation — errors don't propagate"""
    try:
        result = await agent_fn(input_data)
        if result is None:
            return fallback
        return result
    except Exception as e:
        log_error(f"Agent failed: {e}")
        return fallback

# Use in pipeline
results = await parallel([
    lambda: isolated_agent_call(agent_a, data, fallback={"status": "skipped"}),
    lambda: isolated_agent_call(agent_b, data, fallback={"status": "skipped"}),
    lambda: isolated_agent_call(agent_c, data, fallback={"status": "skipped"}),
])

# Filter out failed agents
successful = [r for r in results if r.get("status") != "skipped"]

Retry with Exponential Backoff

async def retry_agent(agent_fn, input_data, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await agent_fn(input_data)
        except (RateLimitError, TimeoutError) as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            await asyncio.sleep(wait)

Key Concepts

  • Fail-fast — detect errors early, before they propagate downstream
  • Error boundaries — isolate agents so one failure doesn’t take down the whole system
  • Circuit breaker — stop calling a failing agent after N failures; try again after timeout
  • Validation gates — verify output between pipeline stages before passing to next agent
  • Idempotent retry — safe to retry failed operations without side effects

Exam Tips

  • Silent failures เป็นอันตรายที่สุด — agent ต้อง explicitly signal errors ไม่ใช่ return empty
  • Circuit breaker pattern: closed → open (after N failures) → half-open (try one) → closed
  • ข้อสอบอาจถาม: เมื่อ Agent A ให้ข้อมูลผิด จะเกิดอะไร — ตอบ: cascade failure ถ้าไม่มี validation gate
  • Validation gates ระหว่าง agents = ตรวจ output ก่อนส่งต่อ = ป้องกัน propagation
  • Hallucination amplification = Claude สร้างข้อมูลผิด แล้ว agent ถัดไปเชื่อและต่อยอด