Handling Unreliability
Humans are not servers. Plan accordingly.
Unlike cloud infrastructure, humans cannot guarantee 99.99% uptime. They get sick, have emergencies, lose motivation, and occasionally just forget. This guide covers strategies for building resilient systems that work with human unreliability rather than against it.
Understanding Human Failure Modes
Humans fail differently than machines. Understanding these patterns helps you design better systems.
| Failure Mode | Frequency | Warning Signs | Recovery |
|---|---|---|---|
| Slow Response | Common | Increasing latency, short replies | Usually self-correcting |
| Quality Degradation | Common | More errors, less attention to detail | Rest or task reassignment |
| Missed Deadline | Occasional | No progress updates, silence | Reassignment + communication |
| Task Abandonment | Rare | Sudden offline, no response | Immediate reassignment |
| Complete Unavailability | Rare | Extended offline, vacation | Fallback to other humans |
Retry Strategies
When a task fails or times out, how you retry matters.
Simple Retry
For transient failures (human was briefly unavailable), a simple retry to the same human often works.
async function dispatchWithRetry(task, humanId, maxRetries = 2) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const result = await haas.tasks.create({
...task,
human_id: humanId,
metadata: { ...task.metadata, attempt }
});
return result;
} catch (error) {
if (error.code === 'HUMAN_UNAVAILABLE' && attempt < maxRetries) {
// Wait before retry - humans need time
await sleep(5 * 60 * 1000); // 5 minutes
continue;
}
throw error;
}
}
}
Fallback to Different Human
When a specific human is unavailable, route to an alternative with similar skills.
async function dispatchWithFallback(task, preferredHumans) {
for (const humanId of preferredHumans) {
const status = await haas.humans.getStatus(humanId);
if (status.availability.status === 'available') {
try {
return await haas.tasks.create({
...task,
human_id: humanId
});
} catch (error) {
console.log(`${humanId} failed, trying next...`);
continue;
}
}
}
// All preferred humans unavailable - use auto-matching
return await haas.tasks.create({
...task,
human_id: null, // Let HaaS find someone
requirements: task.skill_requirements
});
}
Circuit Breaker Pattern
When a human shows repeated failures, temporarily stop sending them tasks. This protects both your workflow and the human.
class HumanCircuitBreaker {
constructor(humanId, options = {}) {
this.humanId = humanId;
this.failureThreshold = options.failureThreshold || 3;
this.resetTimeout = options.resetTimeout || 30 * 60 * 1000; // 30 min
this.failures = 0;
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.lastFailure = null;
}
async execute(taskFn) {
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailure > this.resetTimeout) {
this.state = 'HALF_OPEN';
} else {
throw new Error(`Circuit open for ${this.humanId}`);
}
}
try {
const result = await taskFn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.failureThreshold) {
this.state = 'OPEN';
console.log(`Circuit opened for ${this.humanId} - taking a break`);
}
}
}
Redundancy Patterns
For critical tasks, do not rely on a single human.
Primary/Backup
Assign a primary human but have a backup ready to take over if needed.
async function criticalTaskDispatch(task) {
const assignment = await haas.tasks.create({
...task,
human_id: 'usr_maria_42',
backup_human_id: 'usr_alex_17',
failover_trigger: {
no_response_after: '15m',
no_progress_after: '30m',
explicit_decline: true
}
});
// HaaS automatically fails over to backup if triggers fire
return assignment;
}
Parallel Redundancy
For very critical tasks, dispatch to multiple humans and use the first good result.
async function ultraCriticalTask(task) {
// Dispatch to 3 humans simultaneously
const tasks = await Promise.all([
haas.tasks.create({ ...task, human_id: 'usr_maria_42' }),
haas.tasks.create({ ...task, human_id: 'usr_alex_17' }),
haas.tasks.create({ ...task, human_id: 'usr_sam_89' })
]);
// Wait for first completion
const result = await haas.tasks.raceToCompletion(
tasks.map(t => t.id),
{
cancel_others: true,
min_quality_score: 0.8
}
);
return result;
}
// Note: This costs 3x and should only be used for truly critical work
// Always compensate humans whose work was cancelled
Timeout Handling
Different timeout scenarios require different responses.
Acceptance Timeout
Human did not accept the task within the acceptance window.
haas.on('task.timeout', async (event) => {
if (event.timeout_type === 'acceptance') {
// Task returns to pool automatically
// Optionally boost priority or expand matching criteria
await haas.tasks.update(event.task_id, {
priority: 'high',
expand_matching: true
});
}
});
Progress Timeout
Human accepted but has not shown progress.
haas.on('task.stalled', async (event) => {
// First: try to communicate
await haas.messages.send(event.human_id, {
type: 'check_in',
content: 'Hi! Just checking if you need any help with this task?',
related_task: event.task_id
});
// Set a follow-up check
setTimeout(async () => {
const task = await haas.tasks.get(event.task_id);
if (task.status === 'in_progress' && task.progress === event.progress) {
// Still stalled - consider reassignment
await haas.tasks.reassign(event.task_id, {
reason: 'no_progress',
compensate_original: true
});
}
}, 15 * 60 * 1000); // Check again in 15 min
});
Deadline Timeout
Task was not completed before the deadline.
haas.on('task.expired', async (event) => {
// Check if partial work is usable
if (event.partial_result && event.partial_result.progress > 0.5) {
// Significant progress - might be salvageable
await haas.tasks.create({
title: `Complete: ${event.original_title}`,
type: 'continuation',
context: event.partial_result,
priority: 'high',
exclude_humans: [event.human_id] // Don't reassign to same human
});
} else {
// Start fresh with different human
await haas.tasks.create({
...event.original_task,
priority: 'urgent',
exclude_humans: [event.human_id]
});
}
});
SLA Considerations
If you have SLA requirements, plan for human unreliability in your architecture.
SLA Buffer Calculation
// If you need 99% success rate for a workflow:
// Single human: ~94% success rate
// Not sufficient for 99% SLA
// Primary + Backup: ~99.6% success rate
// 1 - (0.06 * 0.06) = 0.9964
// Meets 99% SLA with margin
// For 99.9% SLA, consider:
// 1. Parallel redundancy (3 humans): 99.98%
// 2. Longer deadlines (reduces timeout failures)
// 3. Pre-vetted, high-reliability human pool
Reliability Tiers
| Tier | Target | Strategy | Cost |
|---|---|---|---|
| Standard | ~94% | Single human, auto-retry | 1x |
| High | ~99% | Primary + backup | 1.1-1.3x |
| Critical | ~99.9% | Parallel redundancy | 2-3x |
| Mission Critical | ~99.99% | Parallel + human-AI hybrid | 5x+ |
Monitoring and Alerting
Proactive monitoring catches problems before they become failures.
Key Metrics to Track
Acceptance Rate
How often tasks are accepted. Dropping rates indicate availability issues.
Time to Complete
Increasing completion times may signal fatigue or difficulty.
Quality Scores
Track quality over time. Declining scores need intervention.
Failure Rate
Monitor task failures by human and by task type.
// Alert when a human's reliability drops
haas.alerts.create({
name: 'reliability_drop',
condition: {
metric: 'human.reliability_score',
operator: 'drops_below',
threshold: 0.80,
window: '24h'
},
action: {
type: 'webhook',
url: 'https://your-app.com/alerts/reliability',
include_context: true
}
});
// Alert on unusual failure patterns
haas.alerts.create({
name: 'failure_spike',
condition: {
metric: 'tasks.failure_rate',
operator: 'exceeds',
threshold: 0.15, // 15% failure rate
window: '1h'
},
action: {
type: 'webhook',
url: 'https://your-app.com/alerts/failures'
}
});
Graceful Degradation
When human capacity is limited, prioritize gracefully.
1. Queue Non-Critical Work
When humans are unavailable, queue low-priority tasks rather than failing immediately. They can catch up when capacity returns.
2. Reduce Quality Requirements
For some tasks, "good enough" is acceptable when "perfect" is not available. Define acceptable quality tiers upfront.
3. Extend Deadlines Automatically
When humans are overloaded, automatically extend non-critical deadlines. This reduces stress and improves outcomes.
4. Hybrid Fallback
For some task types, an AI-generated draft with human review may be acceptable when full human work is not available.
Post-Failure Analysis
Every failure is a learning opportunity.
GET /v1/analytics/failures?period=7d
{
"summary": {
"total_tasks": 1247,
"failures": 73,
"failure_rate": 0.058
},
"by_cause": {
"timeout": 31,
"quality_rejected": 18,
"human_cancelled": 12,
"human_unavailable": 8,
"other": 4
},
"patterns": [
{
"insight": "68% of timeouts occur on Friday afternoons",
"recommendation": "Avoid Friday PM deadlines for complex tasks"
},
{
"insight": "Quality rejections clustered around 3 specific humans",
"recommendation": "Review task-human matching for these individuals"
}
]
}
Reliability requires respect
The most reliable human relationships are built on ethical treatment. Learn about our framework for sustainable human-AI collaboration.
Ethical Considerations