Rate Limits

Rate limits control how many requests you can make to Kimchi Model APIs within a given time period. These limits help maintain service stability and ensure fair access for all users.

Kimchi enforces two types of rate limits on serverless inference endpoints:

Requests per minute (RPM) limits the number of API calls you can make each minute.

Tokens per minute (TPM) limits the total number of input and output tokens processed each minute.

If you exceed either limit, the API returns an HTTP status code 429 Too Many Requests. Your application should implement retry logic with exponential backoff to handle rate limit responses gracefully.

Limits by plan

FreeStarterEnterprise
Rate limitsDynamic (infrastructure protection)Dynamic (higher capacity)Custom
Primary gatingCredit balancePAYG after creditsCommitted usage
When credits run outRequests rejectedBilling kicks inCustom

Rate limits are dynamic β€” they adjust based on current system load and available capacity rather than being fixed numbers. This means your effective throughput may vary, but the system is designed to maximise what each tier can use at any given moment.

πŸ“˜

Rate limits exist across all tiers for infrastructure protection. The primary gating mechanism for the Free tier is credit balance, not rate limits. Paid tiers enjoy higher effective capacity.

Rate limit responses

When you exceed your rate limit, the API returns HTTP 429 with an error in the response body:

{"error": "minimax-m2.7 model is rate limited until 2026-02-05T15:32:41Z"}

The response includes a Retry-After header indicating how many seconds to wait before retrying:

Retry-After: 5
πŸ“˜

If you have multiple providers configured for a model, Kimchi automatically attempts fallback to other available providers before returning a rate limit error.

Best practices

  • Respect the Retry-After header β€” When you receive a 429 response, wait the number of seconds specified in the header before retrying.
  • Implement exponential backoff β€” In addition to the Retry-After delay, increase wait times progressively for repeated failures.
  • Batch requests where possible β€” Combine multiple small prompts into fewer, larger requests to reduce overhead.
  • Monitor your usage β€” Track token consumption in the Kimchi console to anticipate when you might approach limits.
  • Use appropriate model sizes β€” Smaller models have higher rate limits. Choose the smallest model that meets your quality requirements for each use case.

Upgrading your plan

When you need higher rate limits, you can upgrade your plan:

  1. Navigate to app.kimchi.dev/settings
  2. Select your desired plan
  3. Click Upgrade and complete the checkout

Rate limit increases take effect immediately after upgrading.