Webhook Retry Logic: Handling Failures Gracefully

Webhooks are inherently unreliable. Not because the concept is flawed, but because they depend on two independent systems communicating over a network. Your server might be deploying, a database connection might time out, or a DNS resolution might fail. Any of those situations means a webhook delivery returns a non-2xx status code (or no response at all), and the event is lost unless the sender retries.

This is why webhook retry logic matters. If you are sending webhooks, you need a retry strategy that recovers from transient failures without overwhelming the receiver. If you are receiving webhooks, you need an endpoint that handles duplicate deliveries safely and degrades gracefully under load.

This article covers both sides. We will look at how Tolinku's webhook system handles retry logic on the sender side, then walk through patterns for building resilient receivers on the consumer side. If you are new to webhooks in the context of deep linking, start with the webhook setup guide or the broader webhooks and integrations overview.

Tolinku webhook configuration for event notifications The webhooks page with create form, webhook list, and delivery log.

Why Webhooks Fail

Before discussing retry strategies, it is worth understanding the failure modes. Webhook deliveries fail for predictable reasons:

Receiver downtime. The target server is restarting, deploying, or experiencing an outage.
Timeouts. The receiver takes too long to respond. Most webhook senders (including Tolinku) enforce a timeout window.
Rate limiting. The receiver returns a 429 status code because it is receiving too many requests.
Application errors. A bug in the receiver code throws a 500 error.
Network issues. DNS failures, TLS handshake errors, or connection resets.

Some of these are transient (a deploy takes 30 seconds), and some are persistent (a misconfigured endpoint URL). Good retry logic distinguishes between these cases. Retrying a transient failure is recovery; retrying a persistent failure is wasted resources.

The HTTP specification (RFC 9110) defines status code semantics that inform retry decisions. A 503 (Service Unavailable) suggests retrying. A 400 (Bad Request) does not, because sending the same payload will produce the same result.

How Tolinku Retries Failed Deliveries

When you register a webhook endpoint in your Tolinku Appspace, the platform begins delivering events for every subscribed event type (link.clicked, deferred_link.claimed, install.tracked, referral.created, referral.completed). Each delivery attempt expects a 2xx response within the timeout window.

If the delivery fails, Tolinku retries with exponential backoff. The retry schedule looks like this:

Attempt	Delay After Previous
1 (initial)	Immediate
2	~1 minute
3	~5 minutes
4	~30 minutes
5	~2 hours
6	~8 hours

Each retry includes the same payload, the same X-Webhook-Signature header (HMAC-SHA256 of the payload using your webhook secret), and the same X-Webhook-Event header indicating the event type. The event envelope stays consistent:

{
  "event": "link.clicked",
  "timestamp": "2026-05-19T14:32:00Z",
  "data": {
    "prefix": "promo",
    "token": "summer-sale",
    "hostname": "links.example.com",
    "ip": "203.0.113.42",
    "platform": "ios",
    "device_type": "mobile",
    "campaign": "summer2026"
  }
}

The payload is identical across retries, which means the X-Webhook-Signature will also be identical. This is important for idempotency (covered below).

You can monitor delivery status and inspect individual attempts in the Tolinku dashboard. See Testing and Deliveries for details on viewing delivery logs and replaying failed events.

Exponential Backoff with Jitter

Exponential backoff is the standard retry pattern for distributed systems. The idea is straightforward: wait longer between each retry attempt. The formula is typically:

delay = base_delay * 2^(attempt - 1)

A base delay of 1 second produces delays of 1s, 2s, 4s, 8s, 16s, and so on. This prevents a failing receiver from being hammered with rapid-fire retries, giving it time to recover.

However, pure exponential backoff has a problem called thundering herd. If a receiver goes down and 1,000 webhook deliveries fail simultaneously, all 1,000 will retry at exactly the same intervals, creating synchronized traffic spikes. The solution is to add jitter, a random component to the delay:

function getRetryDelay(attempt, baseDelayMs = 1000) {
  const exponentialDelay = baseDelayMs * Math.pow(2, attempt - 1);
  const jitter = Math.random() * exponentialDelay;
  return Math.min(exponentialDelay + jitter, 8 * 60 * 60 * 1000); // cap at 8 hours
}

AWS published an influential analysis of exponential backoff and jitter strategies that compares full jitter, equal jitter, and decorrelated jitter. For most webhook use cases, full jitter (where the delay is random(0, exponentialDelay)) works well.

Building Idempotent Receivers

Because retries mean your endpoint will receive the same event more than once, your receiver must be idempotent: processing the same event twice should produce the same result as processing it once.

The simplest approach is to track event IDs you have already processed:

const express = require('express');
const crypto = require('crypto');
const app = express();

app.use('/webhooks', express.raw({ type: 'application/json' }));

app.post('/webhooks/tolinku', (req, res) => {
  const signature = req.headers['x-webhook-signature'];

  // Verify the signature over the raw body
  const expected = crypto
    .createHmac('sha256', process.env.TOLINKU_WEBHOOK_SECRET)
    .update(req.body)
    .digest('hex');

  const sig = Buffer.from(signature || '', 'hex');
  const exp = Buffer.from(expected, 'hex');

  if (sig.length !== exp.length || !crypto.timingSafeEqual(sig, exp)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const payload = JSON.parse(req.body.toString());

  // Hash the payload for deduplication (Tolinku payloads are identical on retry)
  const payloadHash = crypto.createHash('sha256').update(req.body).digest('hex');

  if (processedPayloads.has(payloadHash)) {
    return res.status(200).json({ status: 'already_processed' });
  }

  processedPayloads.add(payloadHash);

  // Process the event
  handleEvent(payload);

  res.status(200).json({ status: 'ok' });
});

Since Tolinku retries with the exact same payload, the raw body bytes are identical across retries, making a SHA-256 hash of the body an effective deduplication key. In production, use Redis SETNX with a TTL:

const Redis = require('ioredis');
const redis = new Redis(process.env.REDIS_URL);

async function tryClaimPayload(rawBody) {
  const hash = crypto.createHash('sha256').update(rawBody).digest('hex');
  // Returns 'OK' only if the key did not already exist
  const result = await redis.set(
    `webhook:processed:${hash}`,
    '1',
    'EX', 259200, // 72h TTL covers the full retry window
    'NX'
  );
  return result === 'OK';
}

Using SETNX (set if not exists) eliminates the race condition where two retries arrive nearly simultaneously and both pass the duplicate check before either marks the payload as processed.

Respond First, Process Later

A common mistake is doing all the work inside the webhook handler before responding. If your handler takes 10 seconds to update a database, call three APIs, and send an email, the webhook sender will likely time out and schedule a retry, even though the work completed.

The better pattern is to acknowledge immediately and process asynchronously:

const { Queue } = require('bullmq');

const webhookQueue = new Queue('webhook-processing', {
  connection: { host: 'localhost', port: 6379 }
});

app.post('/webhooks/tolinku', express.raw({ type: 'application/json' }), async (req, res) => {
  const signature = req.headers['x-webhook-signature'];

  // Verify signature (fast operation)
  if (!verifySignature(req.body, signature)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const payload = JSON.parse(req.body.toString());
  const payloadHash = crypto.createHash('sha256').update(req.body).digest('hex');

  // Enqueue for async processing
  await webhookQueue.add('process-event', payload, {
    jobId: payloadHash, // BullMQ deduplicates by jobId
    attempts: 3,
    backoff: { type: 'exponential', delay: 5000 }
  });

  // Respond immediately
  res.status(200).json({ status: 'queued' });
});

BullMQ is a popular choice for Node.js job queues backed by Redis. It handles its own retry logic for the processing step, and using the event ID as the jobId gives you built-in deduplication. Other options include pg-boss (backed by PostgreSQL) and cloud-native alternatives like AWS SQS.

This separation of concerns, accepting the webhook and processing the webhook, solves two problems at once: the sender gets a fast 200 response, and your processing logic can take as long as it needs.

Dead Letter Queues

Sometimes retries are not enough. After exhausting every attempt, an event delivery may still fail. In message queue terminology, the event needs to go somewhere instead of being discarded. That somewhere is a dead letter queue (DLQ).

A dead letter queue collects events that could not be processed after all retry attempts. The purpose is not to retry them automatically (that already failed) but to make them visible for investigation and manual replay.

On the sender side, Tolinku marks deliveries that exhaust all retry attempts as "failed" in the dashboard. You can inspect the payload and response, identify the issue, and replay the delivery. On the receiver side, if you are using BullMQ:

const { Worker } = require('bullmq');

const worker = new Worker('webhook-processing', async (job) => {
  const event = job.data;

  // Your processing logic
  await processWebhookEvent(event);
}, {
  connection: { host: 'localhost', port: 6379 }
});

worker.on('failed', (job, err) => {
  if (job.attemptsMade >= job.opts.attempts) {
    // All retries exhausted; log for manual review
    console.error(`Dead letter: event ${job.data.id}`, err.message);
    // Move to a dead letter queue or alert the team
    notifyOpsTeam(job.data, err);
  }
});

The key insight is that dead letter queues transform silent data loss into visible, actionable alerts. When an event ends up in the DLQ, someone on your team gets notified, investigates the root cause, and replays the event once the issue is fixed.

Circuit Breakers

If a receiver is completely down, continuing to send retries is wasteful for both sides. A circuit breaker temporarily stops sending requests to a failing endpoint, then periodically checks whether it has recovered.

The circuit breaker pattern (popularized by Michael Nygard in Release It!) has three states:

Closed (normal operation): requests flow through. If failures exceed a threshold, the circuit opens.
Open: requests are blocked. After a timeout period, the circuit moves to half-open.
Half-open: a single test request is sent. If it succeeds, the circuit closes. If it fails, the circuit opens again.

For webhook senders, a circuit breaker prevents a downed receiver from consuming retry resources. For webhook receivers that call downstream services, a circuit breaker prevents cascading failures:

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 60000;
    this.state = 'CLOSED';
    this.failures = 0;
    this.lastFailureTime = null;
  }

  async execute(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime > this.resetTimeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failures++;
    this.lastFailureTime = Date.now();
    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

Libraries like opossum provide production-ready circuit breaker implementations for Node.js with built-in metrics, fallback functions, and event emitters.

Retry Budgets

Exponential backoff and circuit breakers help manage retries, but they do not address the total cost of retrying. A retry budget sets a cap on how much retry traffic is allowed as a proportion of total traffic.

Google's Site Reliability Engineering book describes retry budgets as a percentage of normal requests. For example, a 10% retry budget means that for every 100 normal requests, you allow at most 10 retries. If the retry rate exceeds the budget, new retries are suppressed until the rate drops.

This is especially relevant for webhook systems with high throughput. If your Appspace processes thousands of link.clicked events per hour and a downstream service goes down, unconstrained retries could double or triple your outbound traffic. A retry budget keeps the total load predictable:

class RetryBudget {
  constructor(budgetPercent = 10, windowMs = 60000) {
    this.budgetPercent = budgetPercent;
    this.windowMs = windowMs;
    this.requests = [];
    this.retries = [];
  }

  recordRequest() {
    this.requests.push(Date.now());
    this.cleanup();
  }

  canRetry() {
    this.cleanup();
    const maxRetries = Math.ceil(
      this.requests.length * (this.budgetPercent / 100)
    );
    return this.retries.length < maxRetries;
  }

  recordRetry() {
    this.retries.push(Date.now());
  }

  cleanup() {
    const cutoff = Date.now() - this.windowMs;
    this.requests = this.requests.filter(t => t > cutoff);
    this.retries = this.retries.filter(t => t > cutoff);
  }
}

Putting It All Together

A resilient webhook receiver combines all of these patterns. Here is a summary of the principles:

Verify the signature. Always validate the X-Webhook-Signature header before processing. This prevents spoofed events and confirms the payload has not been tampered with. See webhook rate limiting for additional protection against abuse.

Respond with 200 immediately. Acknowledge receipt before doing heavy processing. Enqueue the event for asynchronous handling.

Deduplicate using event IDs. Use SETNX in Redis (or an equivalent atomic operation) to ensure each event is processed exactly once, even across retries.

Process asynchronously with its own retries. Use a job queue (BullMQ, pg-boss, SQS) that has its own retry logic for the processing step.

Route exhausted events to a dead letter queue. Never silently discard events. Failed events should be visible, inspectable, and replayable.

Use circuit breakers for downstream calls. If your webhook handler calls external APIs, wrap those calls in a circuit breaker to prevent cascading failures.

Set retry budgets for high-throughput systems. Cap retry traffic as a percentage of normal traffic to prevent retry storms.

Webhook retry logic is not glamorous work, but it is the difference between a system that silently loses data and one that recovers gracefully from every failure. Build it once, and you will never wonder whether a link.clicked event made it to your analytics warehouse.

API backend Deep Linking distributed-systems engineering error-handling integrations reliability Webhooks

Get deep linking tips in your inbox

One email per week. No spam.