Model Deployments

The Model deployments feature allows you to deploy and manage AI models directly in your Cast AI-connected Kubernetes cluster.

📘
Currently, Cast AI supports deploying a subset of Ollama models. The available models depend on your cluster's region and GPU availability.

Prerequisites

Before deploying models, ensure your cluster meets these requirements:

Cluster connectivity — The cluster must be connected to Cast AI and running in automated optimization mode (Phase 2).
GPU drivers installation — GPU drivers must be installed on your cluster with the correct tolerations. Follow the GPU driver installation guide to check if your CSP provides drivers or if you need to install them.

📘
Bottlerocket users: Bottlerocket AMIs come with pre-installed NVIDIA drivers. Do not install additional NVIDIA device plugins, as they may conflict with the pre-installed drivers.

GPU daemonset tolerations — Your GPU daemonset must include this required toleration:

tolerations:
- key: "scheduling.cast.ai/node-template"
  operator: "Exists"

Apply it with:

kubectl patch daemonset <daemonset-name> -n <daemonset-namespace> --type=json -p='[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "scheduling.cast.ai/node-template", "operator": "Exists"}]}]'

ARM node GPU compatibility — ARM-based nodes (e.g., AWS Graviton) must have GPUs with NVIDIA Compute Capability 8.0 or higher. Compatible examples: NVIDIA A100, A30 (Ampere), H100, H200 (Hopper), B200, GB200 (Blackwell).

⚠️
GKE users: Cast AI requires container-optimized images for GPU nodes. Ubuntu-based node images are not supported for GPU workloads and will cause deployment failures. See GKE GPU troubleshooting below.

Set up a cluster for model deployments

Navigate to Kimchi > Model Deployments in the Cast AI console.
Click Install Kimchi.
Select your cluster from the list.

📘
Only eligible clusters will appear in this list.

Run the provided script in your terminal or cloud shell.
Wait for the installation to complete.

Deploy a model

Once Kimchi is installed, you can deploy models to your cluster:

📘
When you deploy a model, Cast AI automatically enables the Unscheduled pods policy if it is currently disabled. The policy only affects model deployments — Cast AI activates just the node template for hosted models while keeping all other templates disabled.

Select your cluster.
Choose a supported model from the list.

📘
GPU availability: Some models require specific GPU types. For example, 70b models need A100 GPUs. If the required GPU type is not available in your cluster's region, the model won't appear in the list.

Configure the deployment:
- Specify a service name and port for accessing the deployed model within the cluster.
- Select an existing node template or let Cast AI create a custom template with the recommended configuration.
Click Deploy to start the deployment.

The model deploys into the castai-llms namespace. Monitor progress on the Model Deployments page — status changes from Deploying to Running when complete.

kubectl get pod -n castai-llms

📘
Model deployment may take up to 25 minutes.

Supported models

Available models depend on your cluster's region and GPU types. To get a current list with pricing and GPU requirements:

GET /ai-optimizer/v1beta/organizations/{organizationId}/clusters/{clusterId}/hosted-models

Use deployed models

Deployed models are accessible through the Kimchi Proxy running as castai-ai-optimizer-proxy in the castai-agent namespace.

From within the cluster

http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1

Example request:

curl http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json' \
  -H 'X-API-Key: {CASTAI_API_KEY}' \
  -X POST -d '{
    "model": "deepseek-r1:14b",
    "messages": [{"role": "user", "content": "Explain Kubernetes DaemonSets in one sentence."}]
  }'

From your local machine

Port-forward the proxy service, then make requests to the local port:

kubectl port-forward svc/castai-ai-optimizer-proxy 8080:443 -n castai-agent

curl http://localhost:8080/openai/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json' \
  -H 'X-API-Key: {CASTAI_API_KEY}' \
  -X POST -d '{
    "model": "deepseek-r1:14b",
    "messages": [{"role": "user", "content": "Explain Kubernetes DaemonSets in one sentence."}]
  }'

Troubleshooting

GKE GPU provisioning issues

Ubuntu image compatibility error

Issue: GPU nodes fail to provision on GKE clusters using Ubuntu node pools.

Cause: Cast AI requires container-optimized images for GPU nodes.

Solution:

Create a node configuration with a container-optimized image in the Image field.
Create or update the llms-by-castai node template and link it to this configuration.
Use this template for model deployment.

Custom GPU image naming error

Error: "finding GPU attached instance image for "cluster-api-ubuntu-2204-v1-27.*nvda"

Cause: Cast AI only supports GCP default GPU images. Custom images must follow GCP NVIDIA naming patterns.

Solution: Ensure custom images follow GCP NVIDIA naming conventions:

gke-1259-gke2300-cos-101-17162-210-12-v230516-c-nvda
gke-12510-gke1200-cos-101-17162-210-18-c-cgpv1-nvda