Autoscaling

Kimchi supports intelligent autoscaling for self-hosted model deployments. The autoscaling system automatically manages capacity through three capabilities that work together:

Replica-based scaling automatically adjusts model replicas based on traffic and GPU utilization metrics.

Active and user-defined hibernation scales model deployments to zero replicas during periods of inactivity to minimize GPU costs.

SaaS fallback integration routes requests to external model providers when self-hosted models are unavailable or scaling, ensuring continuous service availability.

Accessing scaling configuration

For new deployments, navigate to Kimchi → Model Deployments and click Deploy model. The scaling options are in the Automation and scaling section of the deployment drawer.

For existing deployments, select your deployment from the list and click the Settings tab. Changes take effect immediately, so you can adjust scaling behavior at any time.

All scaling configuration is also available programmatically through the Kimchi API — see Model deployments API reference for details.

Autoscaling configuration

📘

Autoscaling is currently available only for vLLM-based model deployments. Ollama model deployments support hibernation and fallback features but not replica-based autoscaling.

Enable autoscaling to automatically adjust the number of running replicas based on demand. Configure minimum and maximum replica counts to define scaling boundaries.

Set min replicas for guaranteed minimal availability in production workloads. The max replicas setting prevents runaway scaling and controls maximum resource consumption.

Scaling metrics

Choose the metric that triggers scaling decisions:

Number of requests waiting scales based on request queue depth. The system scales up when requests accumulate and scales down when the queue clears. This works well for variable request patterns and maintains consistent response times.

GPU cache usage scales based on memory utilization, useful for large models.

Automatic model hibernation

Model hibernation automatically scales deployments to zero replicas during inactivity, providing significant cost savings. When enabled, the system monitors request throughput and triggers hibernation when activity falls below configured thresholds.

Configure hibernation thresholds based on your usage patterns. Shorter idle times maximize savings for development environments, while longer thresholds work better for production workloads with less predictable traffic.

When a hibernated model receives requests, the system automatically begins wake-up while routing traffic to your configured fallback model. This ensures hibernation doesn't impact service availability.

Wake-up times vary by model size and GPU availability:

  • Smaller models: 2–3 minutes
  • Larger models: 5–10 minutes

Traffic gradually transitions back to self-hosted deployments once the model is fully operational.

Fallback model configuration

SaaS fallback models ensure service continuity when self-hosted models are hibernating, scaling, or otherwise unavailable. Configure external providers to handle requests during these periods.

Fallbacks are restricted to external SaaS providers to prevent routing loops. Select providers with models similar to your self-hosted deployment for capability matching.

The fallback system activates automatically during hibernation, initial deployment, model errors, and maintenance periods.

📘

If you have not yet registered a provider with Kimchi, fallback configuration will be unavailable until you do so.

Monitoring and observability

Each deployed model provides a detailed event history showing all autoscaling and hibernation activities. Navigate to a specific model and click the Events tab to view a chronological record.

Autoscaling events record when replicas are increased or decreased, including the trigger that caused the scaling action.

Hibernation events track when models scale to zero replicas due to inactivity and when they resume from the hibernated state.

Each event record includes a precise timestamp, event type, description of what occurred, and which system component initiated the action.