Skip to main content

Rate Limits

Rate limits protect the Morpheus Inference API from abuse and ensure fair access for all users. Limits vary by model category — larger, more compute-intensive models have lower limits, while smaller models allow higher throughput.
Rate limits are applied per user, per model category. Using a Small model and a Large model simultaneously counts against separate limits.

Text Models

Text models are grouped into three tiers based on model size: Small (S), Medium (M), and Large (L). Each model on the Available Models page is tagged with its tier.
TierRequests/minTokens/min
S5001,000,000
M50750,000
L20500,000
Small (S) — Example: llama-3.2-3b qwen3-4bMedium (M) — Example: llama-3.3-70b qwen3-next-80bLarge (L) — Example: glm-5 kimi-k2.5Visit Available Models for the most up-to-date model classifications.
Models used with the :web suffix (e.g. llama-3.3-70b:web) share the same rate limit tier as the base model.

Other Models

TypeRequests/min
Embedding500
Audio60
Embedding models have no tokens-per-minute limit — only the requests-per-minute limit applies.

Response Headers

Every API response includes OpenAI-compatible rate limit headers so you can monitor your usage programmatically:
HeaderDescription
X-RateLimit-Limit-RequestsMaximum requests allowed in the current window
X-RateLimit-Limit-TokensMaximum tokens allowed per minute
X-RateLimit-Remaining-RequestsRequests remaining in the current window
X-RateLimit-Remaining-TokensTokens remaining in the current minute
X-RateLimit-Reset-RequestsISO 8601 timestamp when the request window resets
X-RateLimit-Reset-TokensISO 8601 timestamp when the token window resets
Retry-AfterSeconds until you can retry (only present on 429 responses)

Handling Rate Limit Errors

When you exceed a rate limit, the API returns a 429 Too Many Requests response with an OpenAI-compatible error body:
{
  "error": {
    "message": "Rate limit exceeded: 20/20 requests per minute. Please retry after 45 seconds.",
    "type": "rate_limit_exceeded",
    "param": null,
    "code": "rate_limit_exceeded"
  }
}

Best Practices

When you receive a 429 response, wait and retry with increasing delays. Check the Retry-After header for the exact number of seconds to wait.
import time
import requests

def make_request_with_retry(url, headers, data, max_retries=3):
    for attempt in range(max_retries):
        response = requests.post(url, headers=headers, json=data)
        
        if response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
            time.sleep(retry_after)
            continue
        
        return response
    
    raise Exception("Max retries exceeded")
Track X-RateLimit-Remaining-Requests and X-RateLimit-Remaining-Tokens in every response. Slow down proactively when remaining values approach zero instead of waiting for a 429.
If your workload requires high throughput, consider using a Small tier model like qwen3-4b or llama-3.2-3b which allow up to 500 requests/min. Reserve Large tier models for tasks that genuinely require their additional capability.
For embedding workloads, send multiple texts in a single request using the array input format instead of making individual calls per text. This helps you stay within RPM limits.

How Rate Limiting Works

The Morpheus API uses a fixed-window rate limiter, aligned to 60-second boundaries (i.e., each calendar minute). Both request count (RPM) and token usage (TPM) are tracked independently:
  1. Before processing: The API checks your RPM count and estimated token usage against the limits for the requested model’s tier.
  2. After processing: Actual token usage (input + output tokens combined) is recorded against your TPM allowance.
  3. On limit exceeded: The request is rejected with a 429 response before any compute is consumed (for RPM) or tracked for future requests (for TPM).
If the rate limiting service encounters an internal error (e.g., Redis unavailable), the system fails open — your request will be processed normally rather than rejected.

Next Steps