Skip to content

Troubleshooting Excessive Retry Behavior Causing API Overload (Chatwoot / Polling / Redis Backlog) #2445

@DanielSerigiolle

Description

@DanielSerigiolle

Problem Description

I was experiencing multiple instability issues while using Evolution API integrated with:

  • Custom Web Application
  • Chatwoot
  • N8n

The main symptoms were:

  • Message and upsert failures
  • QR code inconsistencies
  • Instance update instability
  • Random disconnections
  • General API slowdown
  • Temporary freezing or unresponsiveness

After investigation, I identified that the root cause was excessive and uncontrolled retry behavior combined with high-frequency polling against the Evolution API.


Specific Case: Chatwoot Retry Loop

When Chatwoot fails to connect an inbox/instance (for example, WhatsApp via Evolution), it enters automatic retry mode.

This retry behavior:

  • Continuously attempts reconnection
  • Accumulates background jobs
  • Generates repeated API calls
  • Creates excessive polling
  • Floods the Evolution API with requests

Over time, this results in:

  • High inbound traffic
  • Redis job backlog accumulation
  • API instability
  • Failed message processing
  • QR code failures
  • Instance update inconsistencies

In extreme cases, the API becomes partially unresponsive.


Current Workaround (Temporary Fix)

To restore stability when overload occurs, the only working solution has been:

  1. Restart Evolution API
  2. Restart Redis
  3. In some cases, manually clear Redis job queues

This clears the backlog and immediately stabilizes the system.

However, this is a reactive fix, not a preventive solution.


Root Cause

The Evolution API currently appears to lack internal protections against:

  • Excessive retry loops
  • High-frequency polling
  • Uncontrolled client-side reconnection attempts
  • Abusive or broken request patterns

A single misconfigured client can overload the entire API.


Suggested Improvements

1) Built-in Rate Limiting

Implement rate limiting based on:

  • API key
  • Instance
  • IP address

Return proper HTTP 429 (Too Many Requests) responses when limits are exceeded.

This would prevent a single integration from destabilizing the entire system.


2) Retry Loop Protection / Anti-Spam Safeguards

  • Detect repeated identical requests within short timeframes
  • Temporarily block abusive request patterns
  • Implement circuit breaker behavior for failing instances
  • Add request throttling mechanisms

3) Controlled Webhook Queue System

Improve webhook reliability and safety:

  • Limit outbound webhook request rate
  • Implement exponential backoff
  • Add dead-letter queue for failed events
  • Temporarily disable webhook endpoints that repeatedly fail

4) Architectural Recommendation (Most Important)

The ideal architecture should minimize client-side polling and rely on an event-driven model.

Instead of clients constantly querying the API for:

  • Instance status
  • Conversations
  • Messages
  • QR codes

The system should prioritize:

  • Evolution pushing all updates via webhook
  • Clients only consuming events
  • API used only for explicit actions (send message, create instance, etc.)

In other words:

  • Avoid constant API polling
  • Avoid status-check loops
  • Prefer push-based updates over pull-based synchronization

This would significantly reduce server load and improve scalability.


Conclusion

The instability was not caused by a bug in message processing itself, but by uncontrolled retry and polling behavior that overloaded the API and Redis job queues.

Implementing rate limiting, retry protection, and stronger event-driven guidance in documentation would likely prevent many future incidents.

I am sharing this experience to help improve system resilience and avoid similar issues for other users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions