Best Fault Tolerance Patterns for Web Apps

Best Fault Tolerance Patterns for Web Apps

Profile-Image
Bright SEO Tools in saas Published: Apr 04, 2026 | Updated: Apr 04, 2026 · 2 months ago
0:00

Best Fault Tolerance Patterns for Web Apps

Fault tolerance is the ability of a system to continue operating when components fail. Web applications face constant failures — network timeouts, unresponsive APIs, database connection errors, memory exhaustion. The difference between systems that crash under these conditions and those that degrade gracefully is architectural: fault-tolerant systems anticipate failures and contain their impact through specific design patterns.

This article covers the essential fault tolerance patterns for production web applications: circuit breakers to prevent cascading failures, retry logic with exponential backoff for transient errors, bulkheads to isolate failures, timeouts to prevent resource exhaustion, and fallback strategies to maintain functionality when dependencies fail. You'll learn not just how to implement these patterns, but when each pattern solves specific failure modes and how they work together to create resilient systems.

The focus is on practical implementation in Node.js and TypeScript with real-world examples that handle the failures you actually encounter in production: slow third-party APIs, intermittent database connections, and services that become temporarily unavailable.

Why Applications Fail Without Fault Tolerance

Most production failures aren't caused by bugs in your code — they're caused by dependencies behaving unexpectedly. A payment API that normally responds in 100ms suddenly takes 30 seconds. A database that handles 1000 queries per second gets overwhelmed at 1100. A third-party service that's been reliable for months goes down during your peak traffic period.

Without fault tolerance patterns, these dependency failures cascade through your system. Slow responses consume connection pool slots until no connections are available. Retrying failed requests without backoff creates retry storms that amplify load on already-struggling services. Requests that should fail fast instead hang for minutes, consuming memory and preventing new requests from being processed.

The common pattern: a small failure in one component triggers timeouts in dependent components, which trigger retries, which create additional load, which causes more failures, until the entire system becomes unresponsive. Fault tolerance patterns break this cascade by containing failures, failing fast, and providing degraded functionality instead of complete failure.

Warning: Fault tolerance patterns add complexity. Don't implement them preemptively for every service call. Start by identifying your actual failure points through monitoring, then apply patterns to protect against the failures you're experiencing. Premature fault tolerance is over-engineering.

Circuit Breaker Pattern: Preventing Cascading Failures

The circuit breaker pattern prevents your application from repeatedly calling a failing service. Like an electrical circuit breaker that opens when it detects too much current, a software circuit breaker opens when it detects too many failures, immediately returning errors instead of attempting calls that will likely fail.

The pattern operates in three states: closed (normal operation, requests pass through), open (too many failures detected, requests fail immediately), and half-open (testing if the service has recovered). This prevents wasting resources on calls that will fail and gives the failing service time to recover without being hammered by retry attempts.

class CircuitBreaker {
    constructor(options = {}) {
        this.failureThreshold = options.failureThreshold || 5;
        this.successThreshold = options.successThreshold || 2;
        this.timeout = options.timeout || 60000; // 60 seconds
        this.state = 'CLOSED';
        this.failureCount = 0;
        this.successCount = 0;
        this.nextAttempt = Date.now();
    }

    async execute(operation) {
        if (this.state === 'OPEN') {
            if (Date.now() < this.nextAttempt) {
                throw new Error('Circuit breaker is OPEN');
            }
            // Try to recover
            this.state = 'HALF_OPEN';
        }

        try {
            const result = await operation();
            this.onSuccess();
            return result;
        } catch (error) {
            this.onFailure();
            throw error;
        }
    }

    onSuccess() {
        this.failureCount = 0;

        if (this.state === 'HALF_OPEN') {
            this.successCount++;
            if (this.successCount >= this.successThreshold) {
                this.state = 'CLOSED';
                this.successCount = 0;
                console.log('Circuit breaker closed - service recovered');
            }
        }
    }

    onFailure() {
        this.failureCount++;
        this.successCount = 0;

        if (this.failureCount >= this.failureThreshold) {
            this.state = 'OPEN';
            this.nextAttempt = Date.now() + this.timeout;
            console.log(`Circuit breaker opened - will retry after ${this.timeout}ms`);
        }
    }

    getState() {
        return this.state;
    }
}

// Usage example
const paymentAPIBreaker = new CircuitBreaker({
    failureThreshold: 5,
    successThreshold: 2,
    timeout: 30000
});

async function processPayment(orderId, amount) {
    try {
        return await paymentAPIBreaker.execute(async () => {
            const response = await fetch('https://api.payment.com/charge', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ orderId, amount }),
                signal: AbortSignal.timeout(5000)
            });

            if (!response.ok) {
                throw new Error(`Payment API error: ${response.status}`);
            }

            return response.json();
        });
    } catch (error) {
        if (error.message === 'Circuit breaker is OPEN') {
            // Handle circuit breaker open state
            console.log('Payment service is unavailable, using fallback');
            return { status: 'pending', message: 'Payment queued for processing' };
        }
        throw error;
    }
}

The circuit breaker tracks failures and automatically stops calling the payment API after 5 consecutive failures. It stays open for 30 seconds, then enters half-open state to test if the service has recovered. If 2 consecutive requests succeed in half-open state, the circuit closes and normal operation resumes.

Critical consideration: circuit breakers should be implemented at the service boundary (where you call external services), not around individual database queries. Opening a circuit breaker around database queries means all database access fails, which is rarely what you want. Circuit breakers protect against dependency failures, not internal failures.

Pro Tip: Monitor circuit breaker state changes. When a circuit opens, that's a critical event indicating a dependency failure. Alert your team immediately so they can investigate whether it's a temporary issue or requires intervention. Don't wait for user complaints to discover that your payment processing has been down for 10 minutes.

Retry Pattern with Exponential Backoff

Retries allow transient failures to succeed on subsequent attempts. Network blips, temporary service overload, and brief database connection issues often resolve if you simply try again. The key is implementing retries intelligently — immediate retries amplify load on struggling services, while exponential backoff gives services time to recover.

Exponential backoff means each retry waits longer than the previous one: first retry after 1 second, second retry after 2 seconds, third retry after 4 seconds. This prevents retry storms where thousands of clients simultaneously retry requests, overwhelming the already-struggling service.

class RetryPolicy {
    constructor(options = {}) {
        this.maxAttempts = options.maxAttempts || 3;
        this.baseDelay = options.baseDelay || 1000; // 1 second
        this.maxDelay = options.maxDelay || 30000; // 30 seconds
        this.retryableErrors = options.retryableErrors || [
            'ECONNRESET',
            'ETIMEDOUT',
            'ECONNREFUSED',
            'EHOSTUNREACH'
        ];
        this.retryableStatusCodes = options.retryableStatusCodes || [
            408, // Request Timeout
            429, // Too Many Requests
            500, // Internal Server Error
            502, // Bad Gateway
            503, // Service Unavailable
            504  // Gateway Timeout
        ];
    }

    async execute(operation, context = {}) {
        let lastError;

        for (let attempt = 1; attempt <= this.maxAttempts; attempt++) {
            try {
                const result = await operation();

                // Log successful retry
                if (attempt > 1) {
                    console.log(`Operation succeeded on attempt ${attempt}`);
                }

                return result;
            } catch (error) {
                lastError = error;

                // Check if error is retryable
                if (!this.isRetryable(error)) {
                    throw error;
                }

                // Don't delay after the last attempt
                if (attempt === this.maxAttempts) {
                    break;
                }

                // Calculate delay with exponential backoff and jitter
                const delay = this.calculateDelay(attempt);

                console.log(
                    `Attempt ${attempt} failed: ${error.message}. ` +
                    `Retrying in ${delay}ms...`
                );

                await this.sleep(delay);
            }
        }

        // All retries exhausted
        throw new Error(
            `Operation failed after ${this.maxAttempts} attempts: ${lastError.message}`
        );
    }

    isRetryable(error) {
        // Check error code
        if (this.retryableErrors.includes(error.code)) {
            return true;
        }

        // Check HTTP status code
        if (error.response &&
            this.retryableStatusCodes.includes(error.response.status)) {
            return true;
        }

        return false;
    }

    calculateDelay(attempt) {
        // Exponential backoff: 2^attempt * baseDelay
        const exponentialDelay = Math.pow(2, attempt - 1) * this.baseDelay;

        // Add jitter (random 0-25% variation) to prevent thundering herd
        const jitter = exponentialDelay * 0.25 * Math.random();

        // Apply max delay cap
        return Math.min(exponentialDelay + jitter, this.maxDelay);
    }

    sleep(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

// Usage
const retryPolicy = new RetryPolicy({
    maxAttempts: 4,
    baseDelay: 1000,
    maxDelay: 10000
});

async function fetchUserData(userId) {
    return retryPolicy.execute(async () => {
        const response = await fetch(`https://api.users.com/users/${userId}`, {
            signal: AbortSignal.timeout(5000)
        });

        if (!response.ok) {
            const error = new Error(`HTTP ${response.status}`);
            error.response = { status: response.status };
            throw error;
        }

        return response.json();
    });
}

// Retry only idempotent operations
async function getUserProfile(userId) {
    try {
        return await fetchUserData(userId);
    } catch (error) {
        console.error('Failed to fetch user data after retries', error);
        // Return cached data or default profile
        return { id: userId, name: 'Unknown', cached: true };
    }
}

The jitter component (random 0-25% variation in delay) is critical. Without jitter, all clients that failed at the same time will retry at exactly the same time, creating synchronized retry waves that look like DDoS attacks to the recovering service. Jitter spreads retries across time.

Important limitation: only retry idempotent operations (operations that can be safely executed multiple times). GET requests are idempotent. POST requests that create resources are not — retrying might create duplicate resources. For non-idempotent operations, use idempotency keys to make them safely retryable.

Bulkhead Pattern: Failure Isolation

The bulkhead pattern isolates resources so that failure in one part of the system doesn't exhaust resources needed by other parts. Named after ship bulkheads that compartmentalize a ship's hull, this pattern prevents a single slow or failing dependency from consuming all available connections, threads, or memory.

The implementation uses resource pools — separate connection pools for different services, separate thread pools for different types of work, or separate memory budgets for different operations. When one pool is exhausted, other pools continue functioning normally.

class BulkheadPool {
    constructor(name, maxConcurrent, queueSize = 0) {
        this.name = name;
        this.maxConcurrent = maxConcurrent;
        this.queueSize = queueSize;
        this.activeCount = 0;
        this.queue = [];
    }

    async execute(operation) {
        // Check if we can execute immediately
        if (this.activeCount < this.maxConcurrent) {
            return this.runOperation(operation);
        }

        // Check if queue is full
        if (this.queue.length >= this.queueSize) {
            throw new Error(`Bulkhead ${this.name} is full (${this.activeCount} active, ${this.queue.length} queued)`);
        }

        // Queue the operation
        return new Promise((resolve, reject) => {
            this.queue.push({ operation, resolve, reject });
        });
    }

    async runOperation(operation) {
        this.activeCount++;

        try {
            const result = await operation();
            return result;
        } finally {
            this.activeCount--;
            this.processQueue();
        }
    }

    processQueue() {
        if (this.queue.length === 0 || this.activeCount >= this.maxConcurrent) {
            return;
        }

        const { operation, resolve, reject } = this.queue.shift();

        this.runOperation(operation)
            .then(resolve)
            .catch(reject);
    }

    getMetrics() {
        return {
            name: this.name,
            active: this.activeCount,
            queued: this.queue.length,
            capacity: this.maxConcurrent,
            utilization: (this.activeCount / this.maxConcurrent * 100).toFixed(1) + '%'
        };
    }
}

// Create separate bulkheads for different dependencies
const paymentBulkhead = new BulkheadPool('payment-api', 10, 20);
const searchBulkhead = new BulkheadPool('search-api', 50, 100);
const databaseBulkhead = new BulkheadPool('database', 20, 50);

// Payment API calls use payment bulkhead
async function processPayment(orderId, amount) {
    return paymentBulkhead.execute(async () => {
        const response = await fetch('https://api.payment.com/charge', {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({ orderId, amount })
        });
        return response.json();
    });
}

// Search API calls use search bulkhead
async function searchProducts(query) {
    return searchBulkhead.execute(async () => {
        const response = await fetch(`https://api.search.com/search?q=${query}`);
        return response.json();
    });
}

// Monitor bulkhead utilization
setInterval(() => {
    console.log('Bulkhead metrics:', {
        payment: paymentBulkhead.getMetrics(),
        search: searchBulkhead.getMetrics(),
        database: databaseBulkhead.getMetrics()
    });
}, 10000);

With this pattern, if the payment API becomes slow and consumes all 10 slots in its bulkhead, search and database operations continue unaffected using their separate bulkheads. Without bulkheads, slow payment API calls could exhaust the entire connection pool, making search and database operations fail even though those services are healthy.

The queue size parameter handles temporary load spikes. When all 10 payment slots are busy, up to 20 additional requests can queue. This prevents rejecting requests during brief bursts while still maintaining a limit to prevent unbounded queuing.

Pattern Prevents Use When
Circuit Breaker Repeated calls to failing services Service failures are persistent (minutes to hours)
Retry with Backoff Retry storms amplifying load Failures are transient (seconds)
Bulkhead One slow service exhausting all resources You have multiple dependencies with different SLAs
Timeout Requests hanging indefinitely Always - every network call needs a timeout
Fallback Complete feature failure You have alternative data sources or degraded modes

Timeout Pattern: Preventing Resource Exhaustion

Timeouts are the most fundamental fault tolerance pattern — every network operation needs a maximum time limit. Without timeouts, slow or unresponsive dependencies can hold resources (connections, memory, threads) indefinitely, eventually exhausting available capacity.

The challenge is setting appropriate timeout values. Too short and you reject requests that would have succeeded. Too long and you waste resources on operations that should fail fast. The right timeout depends on the operation's normal latency characteristics and your system's resource constraints.

class TimeoutManager {
    constructor(defaultTimeout = 5000) {
        this.defaultTimeout = defaultTimeout;
        this.timeouts = new Map();
    }

    setOperationTimeout(operationName, timeout) {
        this.timeouts.set(operationName, timeout);
    }

    async executeWithTimeout(operation, operationName, customTimeout = null) {
        const timeout = customTimeout ||
                       this.timeouts.get(operationName) ||
                       this.defaultTimeout;

        return Promise.race([
            operation(),
            this.createTimeout(timeout, operationName)
        ]);
    }

    createTimeout(ms, operationName) {
        return new Promise((_, reject) => {
            setTimeout(() => {
                reject(new Error(
                    `Operation ${operationName} timed out after ${ms}ms`
                ));
            }, ms);
        });
    }
}

const timeoutManager = new TimeoutManager(5000);

// Set operation-specific timeouts
timeoutManager.setOperationTimeout('payment-api', 10000);  // Payment can take longer
timeoutManager.setOperationTimeout('user-lookup', 2000);   // User lookup should be fast
timeoutManager.setOperationTimeout('analytics', 15000);    // Analytics can be slow

// Usage
async function getUserWithOrders(userId) {
    try {
        // Fast operation - 2 second timeout
        const user = await timeoutManager.executeWithTimeout(
            () => fetch(`https://api.users.com/users/${userId}`).then(r => r.json()),
            'user-lookup'
        );

        // Slower operation - 10 second timeout
        const orders = await timeoutManager.executeWithTimeout(
            () => fetch(`https://api.orders.com/users/${userId}/orders`).then(r => r.json()),
            'order-lookup',
            10000
        );

        return { user, orders };
    } catch (error) {
        if (error.message.includes('timed out')) {
            console.error('Request timed out', error);
            // Handle timeout specifically
            throw new Error('Service is responding slowly, please try again');
        }
        throw error;
    }
}

// Modern fetch API with built-in timeout
async function modernTimeoutExample(userId) {
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), 5000);

    try {
        const response = await fetch(`https://api.users.com/users/${userId}`, {
            signal: controller.signal
        });
        clearTimeout(timeout);
        return response.json();
    } catch (error) {
        clearTimeout(timeout);
        if (error.name === 'AbortError') {
            throw new Error('Request timed out');
        }
        throw error;
    }
}

Determining appropriate timeout values requires monitoring normal operation. If your payment API normally responds in 2 seconds with a 99th percentile of 5 seconds, setting a 10-second timeout allows slow requests to complete while preventing truly stuck requests from holding resources indefinitely. Setting a 3-second timeout would reject valid slow requests.

Key Insight: Timeouts should be set based on p99 latency (99th percentile), not average latency. If average response time is 200ms but p99 is 3 seconds, a 1-second timeout will reject 1% of valid requests. Use 3-5 seconds for most API calls, 10-15 seconds for known-slow operations like payment processing or complex queries.

Fallback Pattern: Graceful Degradation

Fallbacks provide alternative behavior when primary operations fail. This enables graceful degradation — the system continues providing value, just at a reduced level, rather than failing completely. Fallbacks can use cached data, default values, or simplified implementations.

class FallbackStrategy {
    async executeWithFallback(primary, fallbacks = []) {
        const strategies = [primary, ...fallbacks];
        let lastError;

        for (let i = 0; i < strategies.length; i++) {
            try {
                const result = await strategies[i]();

                // Log when using fallback
                if (i > 0) {
                    console.log(`Primary failed, using fallback level ${i}`);
                }

                return {
                    data: result,
                    source: i === 0 ? 'primary' : `fallback-${i}`,
                    degraded: i > 0
                };
            } catch (error) {
                lastError = error;
                console.warn(`Strategy ${i} failed:`, error.message);
                // Continue to next fallback
            }
        }

        throw new Error(`All strategies failed. Last error: ${lastError.message}`);
    }
}

const fallbackStrategy = new FallbackStrategy();
const cache = new Map(); // Simple cache

// Example: Product recommendations with multiple fallback levels
async function getRecommendations(userId) {
    return fallbackStrategy.executeWithFallback(
        // Primary: ML-based personalized recommendations
        async () => {
            const response = await fetch(
                `https://api.ml.com/recommendations/${userId}`,
                { signal: AbortSignal.timeout(3000) }
            );

            if (!response.ok) throw new Error('ML API failed');

            const data = await response.json();
            cache.set(`recommendations:${userId}`, data); // Cache for fallback
            return data;
        },
        [
            // Fallback 1: Cached recommendations
            async () => {
                const cached = cache.get(`recommendations:${userId}`);
                if (!cached) throw new Error('No cached data');
                return cached;
            },

            // Fallback 2: Popular items (simple, always works)
            async () => {
                const response = await fetch('https://api.products.com/popular');
                return response.json();
            },

            // Fallback 3: Static defaults
            async () => {
                return {
                    items: [],
                    message: 'Recommendations temporarily unavailable'
                };
            }
        ]
    );
}

// Example: User profile with fallback to basic info
async function getUserProfile(userId) {
    return fallbackStrategy.executeWithFallback(
        // Primary: Full profile with all enrichments
        async () => {
            const [user, preferences, activity] = await Promise.all([
                fetch(`https://api.users.com/users/${userId}`).then(r => r.json()),
                fetch(`https://api.preferences.com/users/${userId}`).then(r => r.json()),
                fetch(`https://api.activity.com/users/${userId}`).then(r => r.json())
            ]);
            return { user, preferences, activity, complete: true };
        },
        [
            // Fallback: Basic user info only
            async () => {
                const user = await fetch(`https://api.users.com/users/${userId}`)
                    .then(r => r.json());
                return {
                    user,
                    preferences: null,
                    activity: null,
                    complete: false
                };
            }
        ]
    );
}

// Usage with degradation indicator
app.get('/api/recommendations', async (req, res) => {
    try {
        const result = await getRecommendations(req.user.id);

        res.json({
            recommendations: result.data,
            source: result.source,
            degraded: result.degraded
        });

        // Track degraded responses
        if (result.degraded) {
            metrics.increment('recommendations.degraded', {
                source: result.source
            });
        }
    } catch (error) {
        res.status(503).json({
            error: 'Recommendations unavailable'
        });
    }
});

The fallback chain implements progressive degradation. The user experience goes from optimal (personalized ML recommendations) to acceptable (cached recommendations) to minimal (popular items) rather than broken (error page). Each level provides less value but is more reliable.

Combining Patterns: Resilience in Practice

Fault tolerance patterns work best when combined. A production-ready implementation typically uses circuit breakers around retries around timeouts, with bulkheads isolating different services and fallbacks handling complete failures.

class ResilientClient {
    constructor(serviceName, options = {}) {
        this.serviceName = serviceName;

        // Circuit breaker
        this.circuitBreaker = new CircuitBreaker({
            failureThreshold: options.failureThreshold || 5,
            successThreshold: options.successThreshold || 2,
            timeout: options.circuitTimeout || 60000
        });

        // Retry policy
        this.retryPolicy = new RetryPolicy({
            maxAttempts: options.maxRetries || 3,
            baseDelay: options.retryDelay || 1000
        });

        // Bulkhead
        this.bulkhead = new BulkheadPool(
            serviceName,
            options.maxConcurrent || 10,
            options.queueSize || 20
        );

        // Timeout
        this.defaultTimeout = options.timeout || 5000;
    }

    async call(operation, options = {}) {
        const timeout = options.timeout || this.defaultTimeout;

        // Layer 1: Bulkhead (resource isolation)
        return this.bulkhead.execute(async () => {
            // Layer 2: Circuit breaker (fail fast if service is down)
            return this.circuitBreaker.execute(async () => {
                // Layer 3: Retry (handle transient failures)
                return this.retryPolicy.execute(async () => {
                    // Layer 4: Timeout (prevent hanging)
                    return this.executeWithTimeout(operation, timeout);
                });
            });
        });
    }

    async executeWithTimeout(operation, timeout) {
        return Promise.race([
            operation(),
            new Promise((_, reject) =>
                setTimeout(() => reject(new Error('Timeout')), timeout)
            )
        ]);
    }
}

// Usage
const paymentClient = new ResilientClient('payment-api', {
    maxConcurrent: 10,
    maxRetries: 3,
    timeout: 10000,
    failureThreshold: 5
});

const searchClient = new ResilientClient('search-api', {
    maxConcurrent: 50,
    maxRetries: 2,
    timeout: 3000,
    failureThreshold: 10
});

// Make resilient API call
async function processPayment(orderId, amount) {
    try {
        return await paymentClient.call(async () => {
            const response = await fetch('https://api.payment.com/charge', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ orderId, amount })
            });

            if (!response.ok) {
                const error = new Error(`Payment failed: ${response.status}`);
                error.response = { status: response.status };
                throw error;
            }

            return response.json();
        });
    } catch (error) {
        console.error('Payment processing failed', error);

        // Use fallback: queue for retry
        await queuePaymentForRetry(orderId, amount);

        return {
            status: 'pending',
            message: 'Payment queued for processing'
        };
    }
}

This layered approach provides defense in depth. The bulkhead prevents payment failures from consuming all resources. The circuit breaker stops calling a failing payment API. Retries handle transient errors. Timeouts prevent hanging. The fallback queues payments when all else fails.

Warning: Don't combine retries with circuit breakers naively. If you retry 3 times and each retry counts as a failure, you'll open the circuit breaker 3x faster than intended. Either configure the circuit breaker to account for retries or only count final failures (after all retries exhausted) toward circuit breaker thresholds.

Monitoring and Observability

Fault tolerance patterns only work if you monitor them. You need visibility into circuit breaker state changes, retry attempts, bulkhead utilization, and fallback usage to understand your system's health and tune pattern parameters.

class MetricsCollector {
    constructor() {
        this.metrics = new Map();
    }

    increment(metric, tags = {}) {
        const key = this.buildKey(metric, tags);
        const current = this.metrics.get(key) || 0;
        this.metrics.set(key, current + 1);
    }

    gauge(metric, value, tags = {}) {
        const key = this.buildKey(metric, tags);
        this.metrics.set(key, value);
    }

    timing(metric, duration, tags = {}) {
        const key = this.buildKey(metric, tags);
        const existing = this.metrics.get(key) || [];
        existing.push(duration);
        this.metrics.set(key, existing);
    }

    buildKey(metric, tags) {
        const tagStr = Object.entries(tags)
            .map(([k, v]) => `${k}:${v}`)
            .sort()
            .join(',');
        return `${metric}{${tagStr}}`;
    }

    getMetrics() {
        return Object.fromEntries(this.metrics);
    }
}

const metrics = new MetricsCollector();

// Instrumented circuit breaker
class InstrumentedCircuitBreaker extends CircuitBreaker {
    onSuccess() {
        super.onSuccess();
        metrics.increment('circuit_breaker.success', {
            service: this.serviceName,
            state: this.state
        });
    }

    onFailure() {
        const previousState = this.state;
        super.onFailure();

        metrics.increment('circuit_breaker.failure', {
            service: this.serviceName,
            state: previousState
        });

        if (this.state === 'OPEN' && previousState !== 'OPEN') {
            metrics.increment('circuit_breaker.opened', {
                service: this.serviceName
            });
            console.error(`ALERT: Circuit breaker opened for ${this.serviceName}`);
        }
    }
}

// Instrumented retry policy
class InstrumentedRetryPolicy extends RetryPolicy {
    async execute(operation, context = {}) {
        const startTime = Date.now();
        let attempts = 0;

        try {
            const result = await super.execute(operation, context);

            metrics.timing('retry.duration', Date.now() - startTime, {
                service: context.service,
                attempts: attempts.toString()
            });

            if (attempts > 1) {
                metrics.increment('retry.success_after_retry', {
                    service: context.service,
                    attempts: attempts.toString()
                });
            }

            return result;
        } catch (error) {
            metrics.increment('retry.exhausted', {
                service: context.service,
                attempts: this.maxAttempts.toString()
            });
            throw error;
        }
    }
}

// Dashboard endpoint
app.get('/metrics', (req, res) => {
    res.json({
        metrics: metrics.getMetrics(),
        bulkheads: {
            payment: paymentBulkhead.getMetrics(),
            search: searchBulkhead.getMetrics()
        },
        circuits: {
            payment: paymentBreaker.getState(),
            search: searchBreaker.getState()
        }
    });
});

Key metrics to monitor: circuit breaker state changes (openings indicate service degradation), retry success rates (high retry rates indicate reliability issues), bulkhead utilization (near 100% indicates capacity issues), and fallback usage (indicates primary paths are failing).

FAQ

Should I implement fault tolerance patterns for every API call?

No. Start with external services and known-unreliable dependencies. Internal database calls in a well-functioning system usually don't need circuit breakers. Third-party APIs, payment processors, and services you don't control definitely need fault tolerance. Add patterns when you experience actual failures, not preemptively everywhere.

How do I set the right timeout values?

Monitor your service latency in production and set timeouts at p99 or p999 latency plus a buffer. If p99 latency is 2 seconds, set timeout at 5 seconds. This allows slow-but-legitimate requests to complete while preventing truly stuck requests from holding resources. Adjust based on observed timeout rates — if you're timing out 1% of requests, your timeout is too aggressive.

When should I use circuit breakers versus retries?

Use both together. Retries handle transient failures (brief network blips, temporary overload). Circuit breakers handle persistent failures (service down for minutes). Configure circuit breakers to open after multiple failed retry attempts, not individual requests. This prevents retry storms while still allowing transient failures to succeed.

How do I prevent retry storms?

Use exponential backoff with jitter. Without jitter, all clients that failed at the same time retry at the same time, creating synchronized waves of retries. Jitter (random variation in retry delay) spreads retries across time. Also use circuit breakers — when a service is clearly down, stop retrying entirely.

What's the difference between timeouts and circuit breakers?

Timeouts prevent individual requests from hanging indefinitely. Circuit breakers prevent making requests to services that are known to be failing. Use timeouts on every network call. Use circuit breakers around services that have experienced repeated failures. A timeout is per-request; a circuit breaker is per-service based on recent history.

How many bulkhead pools should I create?

Create separate pools for services with different SLAs or reliability characteristics. Don't create a pool for every single API endpoint — that's over-engineering. Common pattern: one pool for payment processing (critical, low volume), one for search (high volume, less critical), one for analytics (low priority). Typically 3-5 pools for most applications.

Should I retry POST requests that create resources?

Only with idempotency keys. Include a unique idempotency key in the request (usually a UUID). The server stores processed idempotency keys and returns the same result if it receives the same key again. This makes POST requests safely retryable. Without idempotency keys, retrying POST requests can create duplicate resources.

How do I test fault tolerance patterns?

Use chaos engineering tools to inject failures in test environments. Simulate slow responses, network timeouts, service failures. Verify circuit breakers open, retries work correctly, and fallbacks activate. Also write unit tests that verify pattern behavior — circuit breaker opens after N failures, retries use exponential backoff, timeouts trigger at the right time.

What happens when a circuit breaker opens during high traffic?

All requests to that service immediately fail, which is often preferable to slowly timing out. Use fallbacks to provide degraded functionality. Monitor circuit breaker openings closely — an open circuit during high traffic means you're losing functionality exactly when users need it most. This should trigger immediate investigation and possibly manual intervention.

How do bulkheads interact with autoscaling?

Bulkhead limits are per-instance. If you have 3 instances each with a bulkhead limit of 10, your total capacity is 30 concurrent requests. When autoscaling adds instances, capacity increases proportionally. Set bulkhead limits based on per-instance capacity (CPU, memory, connection pools), not total system capacity. The autoscaler adjusts instance count; bulkheads ensure each instance doesn't exceed its resource limits.

Conclusion

Fault tolerance patterns — circuit breakers, retries with exponential backoff, bulkheads, timeouts, and fallbacks — work together to create systems that continue operating when dependencies fail. The patterns are most effective when layered: bulkheads isolate resource pools, circuit breakers prevent calling known-failing services, retries handle transient errors, timeouts prevent resource exhaustion, and fallbacks provide degraded functionality.

The key to successful fault tolerance is monitoring and tuning. Pattern parameters (timeout durations, retry counts, bulkhead sizes, circuit breaker thresholds) need to match your system's actual behavior and constraints. Start with conservative defaults, instrument thoroughly, and adjust based on observed failure patterns. Patterns that aren't monitored and tuned become noise that obscures real problems rather than preventing them.

Implementing fault tolerance is not about preventing all failures — failures are inevitable in distributed systems. It's about containing failures to prevent them from cascading, recovering automatically when possible, and degrading gracefully when recovery isn't possible. Systems built with these patterns in mind fail in predictable, controllable ways rather than catastrophically.


Share on Social Media: