Overview

Kimchi Inference is a serverless API that gives you instant access to production-ready open-source LLMs — Kimi K2.6, MiniMax M3, and Nemotron. It exposes an OpenAI-compatible endpoint, so any client that works with the OpenAI API works with Kimchi.

No GPUs to provision, no clusters to manage. You send requests to https://llm.kimchi.dev/openai/v1 and pay per token.

How it works

%%{init: {'theme': 'dark'}}%%
flowchart LR
    A["Your client"] --> B["Kimchi Inference API\nhttps://llm.kimchi.dev/openai/v1"]
    B --> C["Open-source model"]
    style B fill:#2d2d2d,stroke:#888,color:#e0e0e0

Kimchi Inference accepts any request that follows the OpenAI chat completions format. Under the hood, Kimchi routes each request to the model you specify, running on Kimchi's own GPU infrastructure.

There are two deployment modes:

Serverless (Kimchi Inference) — Kimchi hosts the models. You pay per token with no infrastructure to manage.
Self-hosted — Run the same models on your own Kubernetes cluster when per-token costs exceed compute costs or compliance requires it. See Hosted Model Deployment.

Available models

Model ID	Best for	Context
kimi-k2.7	Agentic coding, image analysis (latest)	1M
minimax-m3	Code execution, debugging	1M
glm-5.2-fp8	Complex reasoning	1M
deepseek-v4-flash	Fast inference, cost-efficient tasks	1M
nemotron-3-ultra-fp4	Fast inference, cost-efficient tasks	1M

A common pattern is to pair Kimi K2.6 for planning with MiniMax M3 for code execution.

Pricing

Pay-per-token with separate rates for input and output tokens. See kimchi.dev/pricing for current plans and rates, or the Pricing reference for per-model token costs.

Kimchi Inference vs. Kimchi Coding

Kimchi offers two products:

Kimchi Inference (this section) — a serverless API for sending LLM requests from any OpenAI-compatible client. You control the model, prompt, and integration.
Kimchi Coding — an autonomous AI coding agent that uses Kimchi Inference under the hood, adding multi-model orchestration, phase tracking, and cost attribution on top. You describe a task and the agent handles it.

If you want to use Kimchi models through your existing IDE or tool, start with the Quickstart. If you want the full autonomous coding experience, see Kimchi Coding.

Next steps

Quickstart

Create an API key and make your first request in under a minute.

Setup recipes

Step-by-step guides for Claude Code, Cursor, OpenCode, Cline, Windsurf, and Continue.

Rate limits

Understand request limits, token quotas, and how to request higher throughput.