Two years ago I wrote about why reactive autoscaling falls short and what ML brings to the table. A lot has changed. LLMs are now a primary workload in most cloud fleets, and they break almost every assumption the classic autoscaling stack was built on. Here's what's actually different, and where Model Context Protocol fits into the picture.
The three-post series I wrote in 2024 — covering ML-based prediction, Turbonomic's economic model, and the KEDA/Karpenter open-source stack — was accurate for its moment. But LLMs running at scale are a genuinely different class of problem. Not incrementally harder. Structurally different. And the field has responded with new primitives, new metrics, and a new role for LLMs themselves as reasoning agents inside the scaling loop.
Why LLMs Break the Old Playbook
The classic autoscaling signal is CPU utilization. It's a reasonable proxy for traditional web services — request comes in, CPU goes up, replica count follows. For LLM inference, CPU is nearly useless as a signal. The real bottleneck is the GPU, and GPU utilization itself is misleading: a GPU can be at 40% compute utilization but 95% memory utilization (holding KV cache) and be completely saturated. Latency is already degrading. Your HPA watching CPU hasn't noticed yet.
Beyond the wrong metric, LLM inference has three other properties that invalidate the standard playbook. First, startup time: a new LLM pod can't serve a single token until model weights are loaded into GPU memory — 30 to 120 seconds, depending on model size and whether a shared weights cache is in place. Reactive scaling is effectively useless without some form of pre-warming. Second, request heterogeneity: a 200-token prompt and a 20,000-token prompt hit the same endpoint but consume wildly different amounts of GPU memory and time. CPU-based HPA treats them identically. Third, KV cache dynamics: the key-value cache that makes attention efficient is also the primary memory pressure surface, and its behavior depends on prompt length distribution in ways that simple utilization metrics don't capture.
The New Metric Vocabulary
The field has largely converged on four signals that actually describe LLM workload health. They've replaced CPU% for anyone running serious inference infrastructure.
TTFT (Time to First Token) is the prefill latency — how long before the user sees the first character. It's the primary SLO for conversational workloads. NVIDIA's enterprise RAG autoscaling work, published in December 2025, targets TTFT p90 under 2 seconds and scales on it directly via Prometheus + HPA. ITL (Inter-Token Latency) is the decode throughput — how fast tokens stream after the first one. It degrades under KV cache pressure. KV cache utilization (exposed by vLLM as gpu_cache_usage_perc) is the best leading indicator of impending latency degradation — scale up before it hits 90%. And queue depth (num_requests_waiting) is the most actionable real-time signal: when requests are piling up, add replicas now.
Disaggregated Serving Changes the Architecture
One of the bigger architectural shifts of 2025 was disaggregated prefill/decode serving, most prominently in llm-d (Red Hat / IBM, open-sourced 2025) and adopted by several cloud providers. The insight is simple: prefill (processing input tokens, building KV cache) and decode (generating output tokens) have completely different resource profiles. Prefill is memory-bandwidth bound; decode is compute bound. Running them on the same replica means you're always compromising.
Split them into separate pools and you can scale each independently — prefill on TTFT, decode on queue depth and ITL. The Inference Gateway (llm-d's IGW, now at v1.0) does KV-cache-aware routing: it knows which replica has a warm cache for a given prefix and routes there first, dramatically improving cache hit rates. The autoscaler for each pool watches different metrics. This is the architecture that llm-d's "variant autoscaler" implements — measuring capacity per instance, computing a load function that accounts for request shape, and adjusting the mix of prefill vs decode replicas based on live traffic.
KAITO + KEDA: The Practical Stack
For teams running LLM inference on Kubernetes who don't want to build a custom scheduling stack, the practical answer in early 2026 is KAITO + KEDA. KAITO (Kubernetes AI Toolchain Operator) introduced its InferenceSet CRD in v0.8.0 (December 2025), and it now ships a native KEDA scaler that watches num_requests_waiting directly. The loop is clean: request queue builds → KEDA triggers HPA → KAITO InferenceSet scales vLLM replicas → Karpenter provisions GPU nodes.
The one thing that makes this whole setup viable for reactive scaling where it previously wasn't: a shared NFS model weights cache. Pod cold start was the killer — a new vLLM replica downloading Llama-70B from S3 takes 4+ minutes. Mounted as a ReadOnlyMany PVC from an NFS volume on local high-speed storage, startup drops to ~30 seconds. That makes KEDA's reactive scaling actually usable. Red Hat's OpenShift AI team documented this in detail in November 2025.
The scrape-lag trap is worse for LLMs than for anything else. Prometheus scraping at 1-minute intervals, HPA syncing at 15 seconds — you can be 30–90 seconds behind a 20-second spike. For a web service, that hurts. For an LLM workload where new pods take 30 seconds to come up anyway, stacking that lag on top makes reactive scaling unacceptable. KEDA's direct-metric scalers (watching vLLM's live num_requests_waiting over the metrics API) cut this loop to under 5 seconds.
Where MCP Changes Everything
Model Context Protocol launched in November 2024. By December 2025, Anthropic donated it to the Linux Foundation's Agentic AI Foundation, with 97 million monthly SDK downloads and 10,000+ active servers. It is, at this point, the standard interface for connecting LLM agents to real-world tools and data.
The interesting question for autoscaling isn't "can I use MCP with my Kubernetes cluster" — yes, obviously — but rather: what does it mean to have an LLM agent with full observability context and tool access sitting above your existing autoscaling stack?
The pattern emerging in 2025–2026: an LLM agent (Claude, GPT-4, Gemini) connected via MCP to Prometheus metrics, Kubernetes state, cloud cost APIs, and internal runbooks. It doesn't replace KEDA or HPA — those run in the fast inner loop. The agent operates in a slower outer loop: detecting situations that rule-based systems miss, reasoning about trade-offs (do I scale for performance or ride out the spike to protect cost?), and calling patch_hpa(), scale_deployment(), or trigger_karpenter() via MCP tool calls when it decides to act.
This is the thing thresholds can never do: explain their reasoning. An MCP agent can output "TTFT p90 is 3.8s, above our 2s SLO. Queue depth 42. KV cache at 91%. Current GPU cost $4.20/hr/replica. I'm adding 2 decode replicas — prefill is fine, this is decode-bottlenecked." That context is auditable. You can review it, learn from it, feed it back into prompt refinement. Rule engines just act.
The real value of MCP in infrastructure isn't automation — it's contextual reasoning over a full operational picture that no threshold system can replicate.
What Hasn't Changed (And What Still Should)
The economic model I described in the Turbonomic piece still stands. For heterogeneous hybrid environments — VMs, on-prem, multi-cloud, storage — a supply-demand market model that optimizes holistically across layers is still the right abstraction. IBM Turbonomic continues to add GPU-aware optimization (GPU workload optimization shipped in 2024–2025) and is the most mature commercial option for cross-stack ARM at enterprise scale.
What hasn't been solved cleanly: the integration between the LLM agent reasoning layer and the economic engine. Right now they're parallel tracks. The agent doesn't know what the market model is doing; the economic engine doesn't incorporate the agent's reasoning. A system where the economic model provides the cost and capacity context that the agent reasons over — and the agent's decisions feed back into the market model as demand signals — is the obvious synthesis. It's not shipping yet. It's the interesting frontier.
2024 vs 2026: What Changed
2024 State of the Art:
KEDA + Karpenter (Kubernetes native)
CPU/mem metrics (wrong signals for LLM)
Economic engine (Turbonomic) for hybrid
RL as outer loop — research only
Cold starts still a limiting constraint
No standard LLM-specific K8s operator
2026 State of the Art:
KAITO + InferenceSet CRD for LLM ops
TTFT · ITL · KV cache % · queue depth
Disaggregated prefill/decode (llm-d, IGW)
MCP agent as intelligent outer loop
NFS weight cache → cold start ~30s
KV-cache-aware routing in Inference GW
The 2026 Complete Picture
The stack in 2026 isn't a replacement of 2024's work — it's an extension. KEDA and Karpenter are still the right primitives for the fast pod and node loops. The economic engine still handles holistic cross-layer optimization. What's been added on top is a new tier: an LLM agent with MCP access to the full operational context, acting as a reasoning layer for situations the rule-based systems can't handle, and a new substrate of LLM-specific tooling — KAITO, disaggregated serving, KV-cache-aware routing — that makes the fast loops actually work for inference workloads.
If I had to name the one thing that has genuinely changed the problem space rather than just improved the tooling: it's the recognition that scaling LLM inference and running LLMs for scaling are now the same problem. The agent that manages your cluster is itself a model that needs to be served and scaled. That recursive quality is new, and the field is only starting to work out what it implies.
Written February 18, 2026. All references current as of February 18, 2026. Sources include: llm-d (Red Hat/IBM, 2025), KAITO v0.8.0 AKS Engineering Blog (Feb 2026), NVIDIA NIM RAG autoscaling (Dec 2025), Red Hat OpenShift AI vLLM autoscaling (Nov 2025), MCP Linux Foundation donation (Dec 2025). Opinions are my own.
Loading comments...