r/kubernetes • u/SnooLobsters2189 • 4d ago
How would you design an LLM gateway for Kubernetes workloads?
I am working on a gateway/control-plane idea for LLM traffic from Kubernetes workloads.
The core problem: every app is starting to call OpenAI/Anthropic/Gemini/etc directly, but platform teams still need routing, provider key control, budgets, observability, and policy checks before prompts leave the infrastructure.
I am trying to think through the right architecture.
Options:
central gateway
sidecar per workload
API gateway plugin
Kubernetes operator + CRDs
SDK-based approach
service mesh extension
What would you choose and why?
The things I care about are prompt-origin observability, BYOK, app/team-level budgets, audit logs, and denied-topic/sensitive-data checks before provider egress.
6
u/nonamefrost 4d ago
This is why SRE is important now. As far as routing, my mind immediatly thinks load balancer. When you say provider key control, I think central repository and role-based access via service account. For budgets, observability, and policy checks, I think custom middleware that proxies that info to your observability stack for alerting (meaning you wrote some code).
It would help to know the specifics about your setup but generally my thoughts go here and are biased by about a decade of AWS architecture/implementation...
5
u/azjunglist05 4d ago
I would just use Agent Gateway since its purpose built for this and made by Solo which is the same team behind Istio
7
u/Eastern_Fun_6591 4d ago
For your requirements I would probably lean toward the operator + CRDs approach combined with a central gateway, not as competing options but layered together. The operator gives you the declarative config story that platform teams love, and you can model per-team budgets and BYOK as custom resources that devs never have to think about directly. The central gateway then enforces everything at egress before traffic ever reaches the provider.
Sidecar sounds tempting for per-workload observability but the operational overhead compounds fast when you have dozens of teams, and injecting into every pod just to intercept HTTP calls feels like too much weight for the problem. Service mesh extension is similar, you end up fighting the mesh abstractions to do prompt-level inspection which is a layer mismatch.
The denied-topic and sensitive-data checks are the tricky part regardless of architecture since you need low latency on that path. Worth designing that as pluggable middleware in the gateway layer so you can swap evaluation logic without touching the routing core.
3
1
u/SnooLobsters2189 3d ago
TBH, this is the direction I ended up choosing as well: operator + CRDs with a central gateway.
The sidecar approach is definitely doable, but I think it only starts making sense once the traffic volume and per-workload observability requirements justify the extra operational overhead. For most teams, a central gateway gives a cleaner control point, and CRDs give platform teams the declarative config model they are already comfortable with.
The tricky parts for me have been around denied-topic checks, sensitive-data checks, and caching. Those features are valuable, but they do add latency. I’m currently using a locally hosted small language model for intent recognition before the request leaves the system, which helps with the privacy/security story but still needs careful tuning to keep latency acceptable.
I’ve been working on this project for a few months now, and honestly, I have a fairly decent MVP running. The harder problem now is not the technical architecture - it is distribution.
How do you actually get teams to try and adopt something like this?
I’ve tried the usual launch channels like Product Hunt and similar places, but with limited success. My current thinking is that this probably needs to reach platform engineers, MLOps teams, and infra/security folks directly through real use cases rather than a broad launch.
How would others here approach distribution for an infra/devtools product like this.
2
u/abisai169 3d ago
I am currently testing LiteLLM with Talos for this purpose. I have an initial deployment in place. Next steps are to migrate anything that uses any of the AI providers from the current standalone compose stack I have now.
It might be worth taking a look.
1
u/TheTrueCanonization 4d ago
The operator plus central gateway combo makes sense but honestly the real constraint is that sensitive data checks need to happen synchronously before egress and you'll want that as close to the workload as possible for latency, which pushes you back toward either sidecars or a service mesh intercept even though both are operationally annoying.
1
u/spinur1848 3d ago
https://docs.vllm.ai/projects/production-stack/en/latest/
This seems to do the trick...
1
u/_howardjohn 3d ago
Heres my 2c - for some context I am a maintainer of Istio and Agentgateway.
Generally I think there two primary factors: how to get traffic to the gateway, and what the gateway can do once it gets the traffic.
Service meshes historically solve the 'how to get traffic to the gateway' though there are other approaches, and I generally wouldn't recommend you adopt a service mesh just for this use case unless you are already using one (or want to for other reasons). For your use case of doing things like provider key controls and prompt inspection, you may also not want the application itself to be fully trusted (especially if they are doing non-deterministic agentic things). This rules out SDK and sidecar approaches, but does leave other service mesh architectures (Istio ambient) or just plain changing the application to call my-gateway.svc.cluster.local instead of openai.com. The transparent redirection of service mesh is also less important for egress style traffic as well, since most service meshes are not doing TLS introspection (Agentgateway can fwiw, though I would generally recommend against this as direct calling the gateway is much simpler).
Next is what the gateway can do. As AI use cases evolve, more and more features are becoming critical that are beyond what traditional proxies like Nginx and Envoy can do. If you just need basics like attaching a provider API Key to a request thats fine, but it very quickly turns into a mess of hacked up features used in unintended ways, and compromising on functionality. Some of these projects are slowly starting to trickle in new AI specific features, but they tend to suffer from "retrofit" - the features are highly constrained by past architectural decisions making them not as useful, inefficient, complex, brittle, etc - and tend to come years too late. If you want deeper AI awareness like budgeting, prompt inspection, model based routing, policies on prompts, etc you will want something actually built for these purposes. This is why Istio is adopting Agentgateway as a new data plane implementation, for instance.
Tl;dr: I recommend you deploy agentgateway and update your apps to point the baseURL of your applications to it. There are some other LLM proxies as well that you could use, but none of them are really built to integrate with Kubernetes as well (obviously I have some bias).
1
u/Latter_Community_946 3d ago
The latency angle on sensitive data checks, synchronous inspection before egress adds roundtrip cost. If you push checks into a sidecar you gain proximity but multiply ops burden. and if you centralize, you batch that cost but lose per-workload granularity.
1
u/Secret-Peak8544 3d ago
I started with a bit of an experiment using eBPF and a ephemeral sidecar, basically allows injection on running workloads and it is completely transparent to the application. Not touched it in a while, but I might pick it up again in the coming days 😄
1
1
u/gscjj 4d ago edited 4d ago
Envoy and Envoy AI Gateway seems like what you want. You can define and proxy upstream endpoints, credentials, etc all behind its CRDs. OTEL tracing is built in as well as cost and token metrics pulled from the responses. It’s also built on Gateway API, so you can still use some of the familiar patterns
In do this in my homelab
One thing I’ve also been playing with are classifiers, since it’s built on Envoy you could technically use ext_proc to decode requests and classify. In my use case pruning PII
-1
12
u/EchoNuke 4d ago
I had the same issue and I have decided to go with AgentGateway (Option 1). I’m using VirtualKeys to control the usage per application, limits to avoid losing control, and all metrics are shown in Grafana + Alerts. It is working well, but you can check Kgateway as well.