Handling Unreliability - HaaS Documentation

Understanding Human Failure Modes

Humans fail differently than machines. Understanding these patterns helps you design better systems.

Failure Mode	Frequency	Warning Signs	Recovery
Slow Response	Common	Increasing latency, short replies	Usually self-correcting
Quality Degradation	Common	More errors, less attention to detail	Rest or task reassignment
Missed Deadline	Occasional	No progress updates, silence	Reassignment + communication
Task Abandonment	Rare	Sudden offline, no response	Immediate reassignment
Complete Unavailability	Rare	Extended offline, vacation	Fallback to other humans

📊

Baseline reliability: Across our platform, humans complete assigned tasks 94% of the time. The remaining 6% includes cancellations, timeouts, and life events. Plan for this margin.

Retry Strategies

When a task fails or times out, how you retry matters.

Simple Retry

For transient failures (human was briefly unavailable), a simple retry to the same human often works.

Simple Retry Logic

async function dispatchWithRetry(task, humanId, maxRetries = 2) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const result = await haas.tasks.create({
        ...task,
        human_id: humanId,
        metadata: { ...task.metadata, attempt }
      });

      return result;
    } catch (error) {
      if (error.code === 'HUMAN_UNAVAILABLE' && attempt < maxRetries) {
        // Wait before retry - humans need time
        await sleep(5 * 60 * 1000); // 5 minutes
        continue;
      }
      throw error;
    }
  }
}

Fallback to Different Human

When a specific human is unavailable, route to an alternative with similar skills.

Fallback Strategy

async function dispatchWithFallback(task, preferredHumans) {
  for (const humanId of preferredHumans) {
    const status = await haas.humans.getStatus(humanId);

    if (status.availability.status === 'available') {
      try {
        return await haas.tasks.create({
          ...task,
          human_id: humanId
        });
      } catch (error) {
        console.log(`${humanId} failed, trying next...`);
        continue;
      }
    }
  }

  // All preferred humans unavailable - use auto-matching
  return await haas.tasks.create({
    ...task,
    human_id: null, // Let HaaS find someone
    requirements: task.skill_requirements
  });
}

⚠️

Retry etiquette: Do not spam a human with repeated task offers. If they declined or timed out, there is usually a reason. Wait at least 30 minutes before retrying the same human.

Circuit Breaker Pattern

When a human shows repeated failures, temporarily stop sending them tasks. This protects both your workflow and the human.

Circuit Breaker Implementation

class HumanCircuitBreaker {
  constructor(humanId, options = {}) {
    this.humanId = humanId;
    this.failureThreshold = options.failureThreshold || 3;
    this.resetTimeout = options.resetTimeout || 30 * 60 * 1000; // 30 min
    this.failures = 0;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.lastFailure = null;
  }

  async execute(taskFn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error(`Circuit open for ${this.humanId}`);
      }
    }

    try {
      const result = await taskFn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failures++;
    this.lastFailure = Date.now();

    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
      console.log(`Circuit opened for ${this.humanId} - taking a break`);
    }
  }
}

🔌

Why circuit breakers help humans: When a human is struggling (sick, overwhelmed, distracted), continued task pressure makes things worse. Giving them automatic breathing room often resolves the issue.

Redundancy Patterns

For critical tasks, do not rely on a single human.

Primary/Backup

Assign a primary human but have a backup ready to take over if needed.

Primary/Backup Pattern

async function criticalTaskDispatch(task) {
  const assignment = await haas.tasks.create({
    ...task,
    human_id: 'usr_maria_42',
    backup_human_id: 'usr_alex_17',
    failover_trigger: {
      no_response_after: '15m',
      no_progress_after: '30m',
      explicit_decline: true
    }
  });

  // HaaS automatically fails over to backup if triggers fire
  return assignment;
}

Parallel Redundancy

For very critical tasks, dispatch to multiple humans and use the first good result.

Parallel Redundancy

async function ultraCriticalTask(task) {
  // Dispatch to 3 humans simultaneously
  const tasks = await Promise.all([
    haas.tasks.create({ ...task, human_id: 'usr_maria_42' }),
    haas.tasks.create({ ...task, human_id: 'usr_alex_17' }),
    haas.tasks.create({ ...task, human_id: 'usr_sam_89' })
  ]);

  // Wait for first completion
  const result = await haas.tasks.raceToCompletion(
    tasks.map(t => t.id),
    {
      cancel_others: true,
      min_quality_score: 0.8
    }
  );

  return result;
}

// Note: This costs 3x and should only be used for truly critical work
// Always compensate humans whose work was cancelled

💰

Parallel work is expensive: When you cancel a human's work mid-progress, you still pay for their time. Use parallel redundancy only when the cost of failure exceeds the cost of multiple assignments.

Timeout Handling

Different timeout scenarios require different responses.

Acceptance Timeout

Human did not accept the task within the acceptance window.

Handling Acceptance Timeout

haas.on('task.timeout', async (event) => {
  if (event.timeout_type === 'acceptance') {
    // Task returns to pool automatically
    // Optionally boost priority or expand matching criteria
    await haas.tasks.update(event.task_id, {
      priority: 'high',
      expand_matching: true
    });
  }
});

Progress Timeout

Human accepted but has not shown progress.

Handling Progress Timeout

haas.on('task.stalled', async (event) => {
  // First: try to communicate
  await haas.messages.send(event.human_id, {
    type: 'check_in',
    content: 'Hi! Just checking if you need any help with this task?',
    related_task: event.task_id
  });

  // Set a follow-up check
  setTimeout(async () => {
    const task = await haas.tasks.get(event.task_id);
    if (task.status === 'in_progress' && task.progress === event.progress) {
      // Still stalled - consider reassignment
      await haas.tasks.reassign(event.task_id, {
        reason: 'no_progress',
        compensate_original: true
      });
    }
  }, 15 * 60 * 1000); // Check again in 15 min
});

Deadline Timeout

Task was not completed before the deadline.

Handling Deadline Timeout

haas.on('task.expired', async (event) => {
  // Check if partial work is usable
  if (event.partial_result && event.partial_result.progress > 0.5) {
    // Significant progress - might be salvageable
    await haas.tasks.create({
      title: `Complete: ${event.original_title}`,
      type: 'continuation',
      context: event.partial_result,
      priority: 'high',
      exclude_humans: [event.human_id] // Don't reassign to same human
    });
  } else {
    // Start fresh with different human
    await haas.tasks.create({
      ...event.original_task,
      priority: 'urgent',
      exclude_humans: [event.human_id]
    });
  }
});

SLA Considerations

If you have SLA requirements, plan for human unreliability in your architecture.

SLA Buffer Calculation

SLA Planning

// If you need 99% success rate for a workflow:

// Single human: ~94% success rate
// Not sufficient for 99% SLA

// Primary + Backup: ~99.6% success rate
// 1 - (0.06 * 0.06) = 0.9964
// Meets 99% SLA with margin

// For 99.9% SLA, consider:
// 1. Parallel redundancy (3 humans): 99.98%
// 2. Longer deadlines (reduces timeout failures)
// 3. Pre-vetted, high-reliability human pool

Reliability Tiers

Tier	Target	Strategy	Cost
Standard	~94%	Single human, auto-retry	1x
High	~99%	Primary + backup	1.1-1.3x
Critical	~99.9%	Parallel redundancy	2-3x
Mission Critical	~99.99%	Parallel + human-AI hybrid	5x+

Monitoring and Alerting

Proactive monitoring catches problems before they become failures.

Key Metrics to Track

📈

Acceptance Rate

How often tasks are accepted. Dropping rates indicate availability issues.

⏱️

Time to Complete

Increasing completion times may signal fatigue or difficulty.

⭐

Quality Scores

Track quality over time. Declining scores need intervention.

🚨

Failure Rate

Monitor task failures by human and by task type.

Setting Up Alerts

// Alert when a human's reliability drops
haas.alerts.create({
  name: 'reliability_drop',
  condition: {
    metric: 'human.reliability_score',
    operator: 'drops_below',
    threshold: 0.80,
    window: '24h'
  },
  action: {
    type: 'webhook',
    url: 'https://your-app.com/alerts/reliability',
    include_context: true
  }
});

// Alert on unusual failure patterns
haas.alerts.create({
  name: 'failure_spike',
  condition: {
    metric: 'tasks.failure_rate',
    operator: 'exceeds',
    threshold: 0.15, // 15% failure rate
    window: '1h'
  },
  action: {
    type: 'webhook',
    url: 'https://your-app.com/alerts/failures'
  }
});

Graceful Degradation

When human capacity is limited, prioritize gracefully.

1. Queue Non-Critical Work

When humans are unavailable, queue low-priority tasks rather than failing immediately. They can catch up when capacity returns.

2. Reduce Quality Requirements

For some tasks, "good enough" is acceptable when "perfect" is not available. Define acceptable quality tiers upfront.

3. Extend Deadlines Automatically

When humans are overloaded, automatically extend non-critical deadlines. This reduces stress and improves outcomes.

4. Hybrid Fallback

For some task types, an AI-generated draft with human review may be acceptable when full human work is not available.

Post-Failure Analysis

Every failure is a learning opportunity.

Failure Analysis Report

GET /v1/analytics/failures?period=7d

{
  "summary": {
    "total_tasks": 1247,
    "failures": 73,
    "failure_rate": 0.058
  },
  "by_cause": {
    "timeout": 31,
    "quality_rejected": 18,
    "human_cancelled": 12,
    "human_unavailable": 8,
    "other": 4
  },
  "patterns": [
    {
      "insight": "68% of timeouts occur on Friday afternoons",
      "recommendation": "Avoid Friday PM deadlines for complex tasks"
    },
    {
      "insight": "Quality rejections clustered around 3 specific humans",
      "recommendation": "Review task-human matching for these individuals"
    }
  ]
}

Reliability requires respect

The most reliable human relationships are built on ethical treatment. Learn about our framework for sustainable human-AI collaboration.

Ethical Considerations