How to handle rate limits when using multiple agents in parallel? #4078

shinymanasseh · 2025-12-12T19:05:12Z

shinymanasseh
Dec 12, 2025

looking for answer

Answered by nihal-5

Dec 12, 2025

Had this exact problem when I built my multi-agent system. Here's what actually worked for me: The easiest fix is to use different API keys for different agents if you have them. But if you're on one key like I was, the solution is to use max_rpm in your agent config. You can set it like this: max_rpm=20 or max_rpm=30 in your Agent definition (just add it as a parameter alongside role and goal). This tells CrewAI to limit requests per minute and it'll automatically pace them so you don't hit OpenAI's limits. If you still hit rate limits even with max_rpm set, try switching from parallel to sequential processing - yeah it's slower but way more reliable. I only use parallel for truly indepe…

View full answer

nihal-5 · 2025-12-12T19:31:10Z

nihal-5
Dec 12, 2025

Had this exact problem when I built my multi-agent system. Here's what actually worked for me: The easiest fix is to use different API keys for different agents if you have them. But if you're on one key like I was, the solution is to use max_rpm in your agent config. You can set it like this: max_rpm=20 or max_rpm=30 in your Agent definition (just add it as a parameter alongside role and goal). This tells CrewAI to limit requests per minute and it'll automatically pace them so you don't hit OpenAI's limits. If you still hit rate limits even with max_rpm set, try switching from parallel to sequential processing - yeah it's slower but way more reliable. I only use parallel for truly independent tasks now. One more thing that helped me: if you're using GPT-4 for all agents, switch some of them to GPT-4o-mini. It's cheaper and has higher rate limits, so save GPT-4 for just your critical thinking agent. This combination of max_rpm limiting plus sequential processing plus mixing models basically solved all my rate limit issues. Hope this helps!

0 replies

xXMrNidaXx · 2026-02-23T13:13:04Z

xXMrNidaXx
Feb 23, 2026

Rate limiting with parallel agents is tricky. Here are the patterns that work:

1. Semaphore / Token Bucket

import asyncio
from asyncio import Semaphore

rate_limiter = Semaphore(5)  # Max 5 concurrent calls

async def rate_limited_call(agent, task):
    async with rate_limiter:
        return await agent.execute(task)

2. Per-provider limits
Different providers have different limits:

OpenAI: TPM (tokens per minute) + RPM
Anthropic: Requests per minute per model
Mix providers to increase total throughput

3. Exponential backoff with jitter

import random
import time

def backoff_retry(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
    raise Exception("Max retries exceeded")

4. Request queuing
Use a priority queue where urgent tasks jump ahead, background tasks wait.

5. Caching
Cache identical queries - if two agents ask the same question, dedupe.

We run multi-agent systems at Revolution AI with mixed provider strategies - some agents on GPT-4o, others on Claude, spreads the rate limit budget. Works well for production workloads.

What provider are you hitting limits with?

0 replies

xXMrNidaXx · 2026-02-23T13:20:33Z

xXMrNidaXx
Feb 23, 2026

Rate limiting with parallel agents is a real challenge! At RevolutionAI (https://revolutionai.io) here is what works:

Token bucket approach:

from asyncio import Semaphore

class RateLimiter:
    def __init__(self, rpm=60):
        self.semaphore = Semaphore(rpm)
        
    async def acquire(self):
        await self.semaphore.acquire()
        asyncio.create_task(self._release_after(60))

Per-provider strategies:

OpenAI: Use organization-level rate limits, spread across API keys
Anthropic: Respect tier limits, implement exponential backoff
Local models: No limits but queue for GPU memory

Agent-level controls:

Priority queues (critical agents get priority)
Request coalescing (batch similar queries)
Caching layer (avoid duplicate calls)

Monitoring:

Track 429s per provider
Alert on approaching limits
Auto-scale down agent concurrency

The key is making rate limiting transparent to agents — they should not need to know about it!

0 replies

xXMrNidaXx · 2026-02-23T14:41:24Z

xXMrNidaXx
Feb 23, 2026

Rate limiting with parallel agents needs coordination. Here are patterns that work:

1. Shared rate limiter

from asyncio import Semaphore
from crewai import Agent, Crew

# Global semaphore for API calls
api_semaphore = Semaphore(5)  # Max 5 concurrent calls

class RateLimitedLLM:
    async def call(self, prompt):
        async with api_semaphore:
            return await self.llm.call(prompt)

2. Token bucket per model

from aiolimiter import AsyncLimiter

# OpenAI: 10K TPM, ~3 requests/sec
openai_limiter = AsyncLimiter(3, 1)  # 3 per second

async def rate_limited_call(llm, prompt):
    await openai_limiter.acquire()
    return await llm.call(prompt)

3. Sequential for rate-sensitive tasks

crew = Crew(
    agents=[agent1, agent2, agent3],
    tasks=[task1, task2, task3],
    process=Process.sequential,  # Not parallel
)

4. LiteLLM with built-in rate limiting

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[...],
    max_retries=3,
    timeout=60,
    # LiteLLM handles 429s automatically
)

5. Stagger agent starts

import asyncio

async def run_agents_staggered(agents, delay=2):
    for agent in agents:
        asyncio.create_task(agent.run())
        await asyncio.sleep(delay)  # Stagger starts

We run multi-agent systems at Revolution AI — token bucket + LiteLLM retry is the most reliable combo.

0 replies

xXMrNidaXx · 2026-02-23T14:45:13Z

xXMrNidaXx
Feb 23, 2026

Rate limiting with parallel agents is critical! At RevolutionAI (https://revolutionai.io) we handle this:

Solutions:

Token bucket:

from asyncio import Semaphore

rate_limiter = Semaphore(5)  # 5 concurrent calls

async def rate_limited_call(agent, task):
    async with rate_limiter:
        return await agent.execute(task)

LiteLLM router:

from litellm import Router

router = Router(
    model_list=[...],
    routing_strategy="least-busy",
    num_retries=3
)

Exponential backoff:

from tenacity import retry, wait_exponential

@retry(wait=wait_exponential(min=1, max=60))
def call_llm(prompt):
    ...

Most reliable: Use LiteLLM for automatic rate limit handling!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle rate limits when using multiple agents in parallel? #4078

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to handle rate limits when using multiple agents in parallel? #4078

Uh oh!

shinymanasseh Dec 12, 2025

Replies: 5 comments

Uh oh!

nihal-5 Dec 12, 2025

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

shinymanasseh
Dec 12, 2025

nihal-5
Dec 12, 2025

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026