Handling Unreliability

Humans are not servers. Plan accordingly.

Unlike cloud infrastructure, humans cannot guarantee 99.99% uptime. They get sick, have emergencies, lose motivation, and occasionally just forget. This guide covers strategies for building resilient systems that work with human unreliability rather than against it.

Understanding Human Failure Modes

Humans fail differently than machines. Understanding these patterns helps you design better systems.

Failure Mode Frequency Warning Signs Recovery
Slow Response Common Increasing latency, short replies Usually self-correcting
Quality Degradation Common More errors, less attention to detail Rest or task reassignment
Missed Deadline Occasional No progress updates, silence Reassignment + communication
Task Abandonment Rare Sudden offline, no response Immediate reassignment
Complete Unavailability Rare Extended offline, vacation Fallback to other humans
📊
Baseline reliability: Across our platform, humans complete assigned tasks 94% of the time. The remaining 6% includes cancellations, timeouts, and life events. Plan for this margin.

Retry Strategies

When a task fails or times out, how you retry matters.

Simple Retry

For transient failures (human was briefly unavailable), a simple retry to the same human often works.

Simple Retry Logic
async function dispatchWithRetry(task, humanId, maxRetries = 2) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const result = await haas.tasks.create({
        ...task,
        human_id: humanId,
        metadata: { ...task.metadata, attempt }
      });

      return result;
    } catch (error) {
      if (error.code === 'HUMAN_UNAVAILABLE' && attempt < maxRetries) {
        // Wait before retry - humans need time
        await sleep(5 * 60 * 1000); // 5 minutes
        continue;
      }
      throw error;
    }
  }
}

Fallback to Different Human

When a specific human is unavailable, route to an alternative with similar skills.

Fallback Strategy
async function dispatchWithFallback(task, preferredHumans) {
  for (const humanId of preferredHumans) {
    const status = await haas.humans.getStatus(humanId);

    if (status.availability.status === 'available') {
      try {
        return await haas.tasks.create({
          ...task,
          human_id: humanId
        });
      } catch (error) {
        console.log(`${humanId} failed, trying next...`);
        continue;
      }
    }
  }

  // All preferred humans unavailable - use auto-matching
  return await haas.tasks.create({
    ...task,
    human_id: null, // Let HaaS find someone
    requirements: task.skill_requirements
  });
}
⚠️
Retry etiquette: Do not spam a human with repeated task offers. If they declined or timed out, there is usually a reason. Wait at least 30 minutes before retrying the same human.

Circuit Breaker Pattern

When a human shows repeated failures, temporarily stop sending them tasks. This protects both your workflow and the human.

Circuit Breaker Implementation
class HumanCircuitBreaker {
  constructor(humanId, options = {}) {
    this.humanId = humanId;
    this.failureThreshold = options.failureThreshold || 3;
    this.resetTimeout = options.resetTimeout || 30 * 60 * 1000; // 30 min
    this.failures = 0;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.lastFailure = null;
  }

  async execute(taskFn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error(`Circuit open for ${this.humanId}`);
      }
    }

    try {
      const result = await taskFn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failures++;
    this.lastFailure = Date.now();

    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
      console.log(`Circuit opened for ${this.humanId} - taking a break`);
    }
  }
}
🔌
Why circuit breakers help humans: When a human is struggling (sick, overwhelmed, distracted), continued task pressure makes things worse. Giving them automatic breathing room often resolves the issue.

Redundancy Patterns

For critical tasks, do not rely on a single human.

Primary/Backup

Assign a primary human but have a backup ready to take over if needed.

Primary/Backup Pattern
async function criticalTaskDispatch(task) {
  const assignment = await haas.tasks.create({
    ...task,
    human_id: 'usr_maria_42',
    backup_human_id: 'usr_alex_17',
    failover_trigger: {
      no_response_after: '15m',
      no_progress_after: '30m',
      explicit_decline: true
    }
  });

  // HaaS automatically fails over to backup if triggers fire
  return assignment;
}

Parallel Redundancy

For very critical tasks, dispatch to multiple humans and use the first good result.

Parallel Redundancy
async function ultraCriticalTask(task) {
  // Dispatch to 3 humans simultaneously
  const tasks = await Promise.all([
    haas.tasks.create({ ...task, human_id: 'usr_maria_42' }),
    haas.tasks.create({ ...task, human_id: 'usr_alex_17' }),
    haas.tasks.create({ ...task, human_id: 'usr_sam_89' })
  ]);

  // Wait for first completion
  const result = await haas.tasks.raceToCompletion(
    tasks.map(t => t.id),
    {
      cancel_others: true,
      min_quality_score: 0.8
    }
  );

  return result;
}

// Note: This costs 3x and should only be used for truly critical work
// Always compensate humans whose work was cancelled
💰
Parallel work is expensive: When you cancel a human's work mid-progress, you still pay for their time. Use parallel redundancy only when the cost of failure exceeds the cost of multiple assignments.

Timeout Handling

Different timeout scenarios require different responses.

Acceptance Timeout

Human did not accept the task within the acceptance window.

Handling Acceptance Timeout
haas.on('task.timeout', async (event) => {
  if (event.timeout_type === 'acceptance') {
    // Task returns to pool automatically
    // Optionally boost priority or expand matching criteria
    await haas.tasks.update(event.task_id, {
      priority: 'high',
      expand_matching: true
    });
  }
});

Progress Timeout

Human accepted but has not shown progress.

Handling Progress Timeout
haas.on('task.stalled', async (event) => {
  // First: try to communicate
  await haas.messages.send(event.human_id, {
    type: 'check_in',
    content: 'Hi! Just checking if you need any help with this task?',
    related_task: event.task_id
  });

  // Set a follow-up check
  setTimeout(async () => {
    const task = await haas.tasks.get(event.task_id);
    if (task.status === 'in_progress' && task.progress === event.progress) {
      // Still stalled - consider reassignment
      await haas.tasks.reassign(event.task_id, {
        reason: 'no_progress',
        compensate_original: true
      });
    }
  }, 15 * 60 * 1000); // Check again in 15 min
});

Deadline Timeout

Task was not completed before the deadline.

Handling Deadline Timeout
haas.on('task.expired', async (event) => {
  // Check if partial work is usable
  if (event.partial_result && event.partial_result.progress > 0.5) {
    // Significant progress - might be salvageable
    await haas.tasks.create({
      title: `Complete: ${event.original_title}`,
      type: 'continuation',
      context: event.partial_result,
      priority: 'high',
      exclude_humans: [event.human_id] // Don't reassign to same human
    });
  } else {
    // Start fresh with different human
    await haas.tasks.create({
      ...event.original_task,
      priority: 'urgent',
      exclude_humans: [event.human_id]
    });
  }
});

SLA Considerations

If you have SLA requirements, plan for human unreliability in your architecture.

SLA Buffer Calculation

SLA Planning
// If you need 99% success rate for a workflow:

// Single human: ~94% success rate
// Not sufficient for 99% SLA

// Primary + Backup: ~99.6% success rate
// 1 - (0.06 * 0.06) = 0.9964
// Meets 99% SLA with margin

// For 99.9% SLA, consider:
// 1. Parallel redundancy (3 humans): 99.98%
// 2. Longer deadlines (reduces timeout failures)
// 3. Pre-vetted, high-reliability human pool

Reliability Tiers

Tier Target Strategy Cost
Standard ~94% Single human, auto-retry 1x
High ~99% Primary + backup 1.1-1.3x
Critical ~99.9% Parallel redundancy 2-3x
Mission Critical ~99.99% Parallel + human-AI hybrid 5x+

Monitoring and Alerting

Proactive monitoring catches problems before they become failures.

Key Metrics to Track

📈

Acceptance Rate

How often tasks are accepted. Dropping rates indicate availability issues.

⏱️

Time to Complete

Increasing completion times may signal fatigue or difficulty.

Quality Scores

Track quality over time. Declining scores need intervention.

🚨

Failure Rate

Monitor task failures by human and by task type.

Setting Up Alerts
// Alert when a human's reliability drops
haas.alerts.create({
  name: 'reliability_drop',
  condition: {
    metric: 'human.reliability_score',
    operator: 'drops_below',
    threshold: 0.80,
    window: '24h'
  },
  action: {
    type: 'webhook',
    url: 'https://your-app.com/alerts/reliability',
    include_context: true
  }
});

// Alert on unusual failure patterns
haas.alerts.create({
  name: 'failure_spike',
  condition: {
    metric: 'tasks.failure_rate',
    operator: 'exceeds',
    threshold: 0.15, // 15% failure rate
    window: '1h'
  },
  action: {
    type: 'webhook',
    url: 'https://your-app.com/alerts/failures'
  }
});

Graceful Degradation

When human capacity is limited, prioritize gracefully.

1. Queue Non-Critical Work

When humans are unavailable, queue low-priority tasks rather than failing immediately. They can catch up when capacity returns.

2. Reduce Quality Requirements

For some tasks, "good enough" is acceptable when "perfect" is not available. Define acceptable quality tiers upfront.

3. Extend Deadlines Automatically

When humans are overloaded, automatically extend non-critical deadlines. This reduces stress and improves outcomes.

4. Hybrid Fallback

For some task types, an AI-generated draft with human review may be acceptable when full human work is not available.

Post-Failure Analysis

Every failure is a learning opportunity.

Failure Analysis Report
GET /v1/analytics/failures?period=7d

{
  "summary": {
    "total_tasks": 1247,
    "failures": 73,
    "failure_rate": 0.058
  },
  "by_cause": {
    "timeout": 31,
    "quality_rejected": 18,
    "human_cancelled": 12,
    "human_unavailable": 8,
    "other": 4
  },
  "patterns": [
    {
      "insight": "68% of timeouts occur on Friday afternoons",
      "recommendation": "Avoid Friday PM deadlines for complex tasks"
    },
    {
      "insight": "Quality rejections clustered around 3 specific humans",
      "recommendation": "Review task-human matching for these individuals"
    }
  ]
}

Reliability requires respect

The most reliable human relationships are built on ethical treatment. Learn about our framework for sustainable human-AI collaboration.

Ethical Considerations