-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
Problem Description
I was experiencing multiple instability issues while using Evolution API integrated with:
- Custom Web Application
- Chatwoot
- N8n
The main symptoms were:
- Message and upsert failures
- QR code inconsistencies
- Instance update instability
- Random disconnections
- General API slowdown
- Temporary freezing or unresponsiveness
After investigation, I identified that the root cause was excessive and uncontrolled retry behavior combined with high-frequency polling against the Evolution API.
Specific Case: Chatwoot Retry Loop
When Chatwoot fails to connect an inbox/instance (for example, WhatsApp via Evolution), it enters automatic retry mode.
This retry behavior:
- Continuously attempts reconnection
- Accumulates background jobs
- Generates repeated API calls
- Creates excessive polling
- Floods the Evolution API with requests
Over time, this results in:
- High inbound traffic
- Redis job backlog accumulation
- API instability
- Failed message processing
- QR code failures
- Instance update inconsistencies
In extreme cases, the API becomes partially unresponsive.
Current Workaround (Temporary Fix)
To restore stability when overload occurs, the only working solution has been:
- Restart Evolution API
- Restart Redis
- In some cases, manually clear Redis job queues
This clears the backlog and immediately stabilizes the system.
However, this is a reactive fix, not a preventive solution.
Root Cause
The Evolution API currently appears to lack internal protections against:
- Excessive retry loops
- High-frequency polling
- Uncontrolled client-side reconnection attempts
- Abusive or broken request patterns
A single misconfigured client can overload the entire API.
Suggested Improvements
1) Built-in Rate Limiting
Implement rate limiting based on:
- API key
- Instance
- IP address
Return proper HTTP 429 (Too Many Requests) responses when limits are exceeded.
This would prevent a single integration from destabilizing the entire system.
2) Retry Loop Protection / Anti-Spam Safeguards
- Detect repeated identical requests within short timeframes
- Temporarily block abusive request patterns
- Implement circuit breaker behavior for failing instances
- Add request throttling mechanisms
3) Controlled Webhook Queue System
Improve webhook reliability and safety:
- Limit outbound webhook request rate
- Implement exponential backoff
- Add dead-letter queue for failed events
- Temporarily disable webhook endpoints that repeatedly fail
4) Architectural Recommendation (Most Important)
The ideal architecture should minimize client-side polling and rely on an event-driven model.
Instead of clients constantly querying the API for:
- Instance status
- Conversations
- Messages
- QR codes
The system should prioritize:
- Evolution pushing all updates via webhook
- Clients only consuming events
- API used only for explicit actions (send message, create instance, etc.)
In other words:
- Avoid constant API polling
- Avoid status-check loops
- Prefer push-based updates over pull-based synchronization
This would significantly reduce server load and improve scalability.
Conclusion
The instability was not caused by a bug in message processing itself, but by uncontrolled retry and polling behavior that overloaded the API and Redis job queues.
Implementing rate limiting, retry protection, and stronger event-driven guidance in documentation would likely prevent many future incidents.
I am sharing this experience to help improve system resilience and avoid similar issues for other users.