Deploy a custom model with Kimchi

This tutorial walks you through taking a model stored in Google Cloud Storage and deploying it with Kimchi, so you can call it through an OpenAI-compatible API.

Introduction

  • Base model — a general-purpose model like LLaMA 3, Mistral, etc.
  • Fine-tuned model — that base model, further trained on your data (for your domain, tone of voice, tasks, etc.).
  • Custom model in Kimchi — any model whose weights live in your storage and are deployed through Kimchi into your Kubernetes cluster.

Prerequisites

  1. Kimchi set up and connected to a cluster
  2. A fine-tuned model saved in a GCS bucket

Typical layout:

gs://my-ml-bucket/models/llama3-support-bot-v1/
├── model.safetensors
├── tokenizer.json
├── config.json
└── metadata.json
📘

If you're unsure about formats, start with Hugging Face-style folders and Safetensors.

Step 1: Prepare your model artifacts in GCS

gs://my-ml-bucket/models/llama3-custom-model/

Step 2: Connect your repository in Kimchi

  1. Log in to the Cast AI console
  2. Navigate to Kimchi → Hosted Models → Repositories
  3. Click Connect
  4. Paste your bucket path
  5. Run the provided script

Step 3: Register your custom model

  • Select a model from the detected list
  • Pick the base model
  • Add a description
  • If needed, enable LoRA adapter

Step 5: Call your custom model

kubectl run test-client --rm -i --tty --image=alpine/curl -- /bin/sh
curl http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json' \
  -H 'Authorization: Bearer $CASTAI_API_KEY' \
  -v -X POST -d '{
    "model": "Llama-3.2-3B-Instruct-abliterated",
    "messages": [
      {
        "role": "user",
        "content": "What kind of instance types to use in GCP for running an AI training model?"
      }
    ]
  }'