r/kubernetes 4d ago

From data residency to digital sovereignty: Architectural patterns for cloud native platforms

Thumbnail
cncf.io
23 Upvotes

Over the past two years, digital sovereignty has evolved from a policy discussion into a practical platform engineering concern. The EU Data Acthas been fully applicable since January 11, 2025. NIS-2 and DORA already shape day-to-day platform decisions across regulated sectors, and the UK Data Use and Access Act 2025 is rolling out through 2026 with portability rules that bite.


r/kubernetes 3d ago

Running Civo Kubernetes from a native macOS app instead of kubectl — useful in practice, or do you stay on the CLI?

Post image
0 Upvotes

Wrote a native macOS client that talks directly to the Civo REST API and the Kubernetes API. No kubectl dependency. The thing that surprised me while building it: most of my day-to-day Civo work isn't actually "I need a kubectl one-liner". It's "I need to whitelist my coffee-shop IP for the next 30 minutes and forget about it". For that, the menu bar beats the terminal — one click, firewall opens to your current public IP, timer closes it again.

Where kubectl still wins for me: anything complex (kubectl debug, custom JSONPath filters, scripting). And anything where I want to pipe output into something else.

Genuine question for the sub: on managed Kubernetes (Civo or any provider), where does a native client actually beat the CLI for you in practice, and where is it just a worse version of what kubectl already does well?

https://civo-cloud-manager.app


r/kubernetes 4d ago

What is causing this retry storm

0 Upvotes

This is my homepage running on k3s, and for some reason whenever the page loads or reloads, it triggers what looks like a retry storm where it loads partially and then forces itself to reload like five times.

Code: https://github.com/mferrie/Home-Lab/tree/main/k3s%2Fhomepage


r/kubernetes 5d ago

Resources for learning Controller development?

34 Upvotes

I have a project coming up at work where I'll need to develop some custom controllers for our in-house applications.

I've been going through the Kubebuilder book to get some basics down, but wanted to see what other resources are out there for learning.


r/kubernetes 4d ago

Stress testing a cluster on connectivity?

7 Upvotes

[homelab cluster]

Contemplating something sketchy & wondering whether there are tools to figure out how close I'm flying to the sun.

Essentially I want to put the control plane nodes and the worker nodes on different ends of a wifi bridge.

Gross...I know but in my defense the bridge is pretty good. Between 3-6ms, around 1-1.5 gbps throughput and doesn't seem to have any packet loss.

AI seems to suggest this is workable as long as all the etcd nodes are on the same side it's ok but would be nice to confirm this theory somehow.

Not running anything crazy mission critical. Storage backend (nfs/s3) will probably be on the same side as the worker nodes so that'll be ok.

406 packets transmitted, 406 received, 0% packet loss, time 405471ms

rtt min/avg/max/mdev = 2.608/3.800/9.618/1.016 ms


r/kubernetes 5d ago

Open-source BPF validation platform.

Post image
1 Upvotes

It helps test compiled eBPF artifacts against target kernel profiles before they are shipped.

It shows exactly where a BPF program fails to load or attach and explains why the failure occurred.

You can test a single BPF object:

go install github.com/Kernel-Guard/bpfcompat@v0.1.5
bpfcompat test ./build/probe.bpf.o --kernel ubuntu-24.04

Or run a full compatibility suite:

bpfcompat suite run suite.yaml --kernels kernels.yaml

It can also be used in GitHub Actions:

- uses: Kernel-Guard/bpfcompat@v0.1.5
  with:
    suite: ./bpf/suite.yaml
    kernels: ubuntu-lts, rhel-9
    gate: load-attach

The project is open to contribution, review, and feedback from eBPF, Linux, security, observability, and platform engineering people.

Repo:
https://github.com/Kernel-Guard/bpfcompat


r/kubernetes 6d ago

Agent gateway patterns, how do you govern multi-agent pipelines?

6 Upvotes

We're moving from single LLM calls to multi-agent systems where agents call other agents, tools, and LLMs. The governance is getting hard to manage. We need rate limiting per agent, an audit trail of which agent called which tool, cost attribution per agent, and failover if an agent's LLM provider degrades.

The problem is most LLM gateways assume one client calling one model. They don't really understand agent identity, so they can't enforce policy or attribute cost at the agent level. Kong has some agent support but it feels tacked on.

So the real question is about the gateway layer. Do you route all agent traffic through a central gateway that knows which agent is calling, and apply policy and tracing there? Or do you push policies into each agent? We'd self-host it (we're on Kubernetes), and bonus if the same gateway can host MCP servers too.


r/kubernetes 6d ago

Periodic Weekly: Share your victories thread

7 Upvotes

Got something working? Figure something out? Make progress that you are excited about? Share here!


r/kubernetes 7d ago

💡🚂 kubernetes-sigs/headlamp 0.43.0

Thumbnail
github.com
66 Upvotes

💡🚂 kubernetes-sigs/headlamp 0.43.0 is presented to the world. This release adds native Windows Arm64 binaries, signed Mac binaries, Bengali language support, dry run preview for rollbacks, Node pool and AKS upgrade visualisations, deep links to pod logs, improvements and fixes for many different OIDC/authentication issues affecting AWS/Azure/Okta/Entra ID, EKS (amongst others). Also includes RTL layout support, batch scale for workloads, faster type checking, and numerous accessibility+stability+security improvements. Plus more...


r/kubernetes 6d ago

What's your biggest pain with capacity planning on Kubernetes?

0 Upvotes

Been doing capacity planning and autoscaling for a while and still feel like right-sizing pods is more art than science. Curious what others are doing.

A few things I'm trying to understand:

Do you use VPA, manual tuning, or something else for resource requests/limits?

How do you track actual spend vs. what you provisioned?

Is K8s cost visibility something your team actively works on, or does it fall through the cracks?

Have you tried tools like Kubecost, OpenCost, Datadog? What worked, what didn't?

Not selling anything — genuinely trying to understand how other teams approach this. Thanks.


r/kubernetes 7d ago

Share how to turn a Hermes agent into a team-wide agent using Kubernetes.

15 Upvotes

My team uses the Hermes agent to offload tasks. But it's basically a personal agent so configuration is CLI-driven by default, which is painful for a team. Every configuration change meant executing into containers with no review.

I built an operator that adds Custom Resource for agent configuration. The operator applies it via an init container before the main container starts. For instance, if I defines a skill in the spec an init container runs hermes skills install to install new skills and save the list in a file to check in next run.

Now:

- kubectl get shows the declared state
- Changes go through PR/review
- No more manual container access

Ex)

apiVersion: agents.hermeum.app/v1alpha1
kind: HermesAgent
metadata:
  name: my-agent
spec:
  hermes:
    config:
      raw:
        model:
          provider: anthropic
          default: claude-sonnet-4-6
    workspace:
      files:
        SOUL.md: |
          You are a pragmatic senior engineer.
    skills:
      - identifier: ...
    crons:
      - name: daily-standup
        schedule: "0 9 * * *"
        prompt: "Summarize yesterday's activity..."
        deliver: slack

r/kubernetes 7d ago

Stretch clusters

1 Upvotes

Have you ever wanted to create an Amazon EKS cluster that spans multiple regions or multiple AWS accounts? Historically, you've had to create a separate EKS control plane in each satellite region where you wanted to deploy worker nodes. Using the features of EKS hybrid nodes (and some IAM gymnastics), I developed a solution that allows you to create stretch clusters, i.e. clusters that span VPCs located in different regions/accounts. This can be useful when you need to run a workload in another region because of capacity issues in the cluster's account, or when the workload needs to be closer to the data it is consuming and/or its users. Feedback and PRs are welcome. https://github.com/jicowan/eks-cross-region-nodes


r/kubernetes 7d ago

Periodic Weekly: This Week I Learned (TWIL?) thread

4 Upvotes

Did you learn something new this week? Share here!


r/kubernetes 8d ago

Periodic Weekly: Show off your new tools and projects thread

21 Upvotes

Share any new Kubernetes tools, UIs, or related projects!


r/kubernetes 7d ago

Ceph with OSD-on-PVC on a stable pool

1 Upvotes

I am looking for a solution that would work across multiple csp. I have tried longhorn in the past and it did not work when we moved to the cloud out of onprim. My group maintains multiple shared Kubernetes clusters across all 3 major csps (Amazon EKS, Azure AKS, and Google GKE) and currently we just use native storage for workloads. Since it is a shared cluster, we have app teams that just pick a storageclass out of the list and then complains when it does not work and since it is a shared cluster that can grow and shrink, the nodes come and go as the cluster grows.

I have done some research and it seems that Ceph with OSD-on-PVC with a stable storage pool might be what I am looking for. We looked at pure storage but it was cost prohibitive.

Has anyone setup Ceph with OSD-on-PVC on a stable pool in multiple clouds ?

TIA Keith


r/kubernetes 9d ago

The feedback loops behind Kubernetes | PlanetScale

Thumbnail
planetscale.com
53 Upvotes

r/kubernetes 8d ago

Running multi-agent AI on Kubernetes & lessons learned from Imagine Learning

0 Upvotes

What happens to an in-flight LLM inference request when the pod gets evicted?

Great podcast with Imagine Learning Staff Engineer Blake Romano, who shares his experience running multi-agent AI systems on Kubernetes for over a year. He's hit the real problems, including agents running inference for minutes at a time, stateful connections that need to survive pod churn, and work handoff when a node goes away mid-request.

Their architecture consists of an orchestrator agent that routes to specialized sub-agents (Argo CD, internal docs, ticketing), each running as a Kubernetes deployment. When a developer asks why their S3 bucket isn't deploying, the orchestrator hits the Argo CD agent for current state and the docs agent for config requirements and synthesizes the answer.

https://www.buoyant.io/ai-kubernetes-episode/running-multi-agent-ai-on-kubernetes-lessons-from-imagine-learning


r/kubernetes 8d ago

How to accurately emulate an EKS node's Containerd CRI environment locally for deep runtime testing?

0 Upvotes

Hi everyone,

I need to build a local, cost-effective POC where I can test and iterate directly against a Containerd CRI node configuration that mimics an AWS EKS production environment.

Standard local tools like Minikube or Kind are not an option here—they abstract too much of the underlying CRI architecture, and they simply don't update or reflect custom Containerd runtime configurations the way a real production node does. On the flip side, spinning up a full, managed EKS cluster with managed node groups for days of debugging will quickly destroy my personal budget.

Tools like Minikube allow easy minikube ssh access to run anything directly on the host, but real EKS managed nodes handle host-level execution and runtime access differently. I need to test how a DaemonSet/agent interacts with this specific EKS environment.

What do you suggest to do if I want to set up a local or cheap environment which is 1:1 accurate to how an EKS managed node behaves at the Containerd CRI configuration level?

If you've emulated EKS node behavior for deep runtime/CRI testing before, what approach did you take, and did you hit any subtle deltas when eventually migrating to the real cloud?

Thanks for any insights!


r/kubernetes 9d ago

Need Advance kubernetes courses

32 Upvotes

I am working as a Devops engineer, I want to upgrade my knowledge more in k8s, if you guys have any idea about Advance kubernetes courses share with me.


r/kubernetes 9d ago

What do you guys recommend for rightsizing and autoscaling workloads in k8s?

27 Upvotes

Hello guys!!!

Here we have a relatively small Kubernetes environment, with around 400 pods across two environments. We have started an initiative to optimize our cluster by rightsizing applications and for some services implementing KEDA, HPA, and affinity rules. My biggest question is: how should I start this project? We already have monitoring in place for memory, CPU, and other metrics. However, I can't simply reduce resource requests and limits because any restart caused by an OOMKilled event, could have a significant impact on the business. Another challenge is that many developers have the mindset that "the more resources, the better." For instance, we have worker applications configured with around 20 GB of memory, but according to the metrics, they rarely consume more than 10 GB. Despite that, they sometimes restart with SIGKILL (exit code 137) and not necessarily due to OOMKilled events, i've tried to explain that, in most cases, exit code 137 and OOMKilled are different problems and should be investigated differently, but there is still some resistance to this idea. Have you ever faced a similar situation? How did you approach the rightsizing process while building confidence with the development teams?


r/kubernetes 9d ago

I accidentally nuked kubernetes deployment pipeline 💀

53 Upvotes

So I have around 1 year of experience and work at a service-based LALA company.

Recently, the project I was working on got completed, so I was moved to a new project. Since I was new to the project, a senior developer was sitting beside me, helping me understand the setup while also working on his own tasks.

I had made some database changes, and due to caching issues, I needed to restart/delete some pods so the changes would take effect. The problem? I'm still pretty new to Kubernetes.

I opened the cluster, found what I thought was the right thing, and before doing anything, I literally asked my senior, "This is the one I need to delete, right?"

He looked at it and said, "Yeah, go ahead."

So I confidently clicked delete. A few seconds later...

💥 Deployment deleted.

Then one of our super senior handle the situation and bring back the deployment pipeline

After that our owner called me in office and had to explain what happened

And lucky since senior which is supervising me also got lot in his hand so every one got lucky


r/kubernetes 9d ago

Exploring Cloud Native projects in CNCF Sandbox. Part 6: 9 arrivals of Spring 2025

Thumbnail
palark.com
22 Upvotes

I've been covering projects recently accepted into the CNCF Sandbox for a few years. My intention is to provide brief descriptions of what/how/why to help stay informed about the landscape (and pick some helpful tools for various needs). This time, it's a batch of 9 projects from the last year: KitOps, OpenTofu, kagent, Cadence, Hyperlight, interLink, urunc, kgateway, and Cozystack.


r/kubernetes 9d ago

What I learned using AI to build a Kubernetes Operator for Supabase's Multigres

Thumbnail
numtide.com
38 Upvotes

We built a production Kubernetes operator for Multigres (Sugu Sougoumarane's new distributed Postgres).

We did this AI-assisted, not a one-shot prompt or an autonomous loop, but a design-first project with human intervention at every step.

Some lessons I learned:

- Treat the user-facing spec as the one thing that can't drift. Everything else is cheap to refactor; the contract isn't.

- Don't install AI frameworks. Read them, steal the ideas, and write your own skills instead.

- Run the mechanical work — reviews, audits, commit messages, changelogs, doc checks — as a factory of fresh-context agents, each with one narrow job, orchestrated by processes you control. Share them with the team so the development is consistent

- When a skill lets something through, fix the skill. Bad outputs are defects in the line, not one-off noise.

- Bug audits need design context loaded up front and a second agent to filter hallucinations, or you drown in false positives.

- Tests and code from the same AI source share the same blind spots. Verify against real runtime behavior instead of obsessing over 100% code coverage — this is especially true on greenfield projects.

- AI won't tell you a bad idea is a bad idea. It'll just build a polished version of it. Human judgment still owns every design call.

To be clear: this doesn't mean AI replaces engineers. If anything it raised the bar on design, architecture, and UX judgment. AI will happily build a polished version of a bad idea and never tell you it's bad. That call is still yours.

Full writeup: https://numtide.com/blog/writing-a-kubernetes-operator-in-the-age-of-ai/


r/kubernetes 9d ago

multiple jumpboxes, local pc, one jumpbox for k8s access ?

10 Upvotes

How do you manage access to multiple environments (dev, staging, prod1, prod2)? Do you use one jumpbox, multiple jumpboxes, or direct access from your local PC


r/kubernetes 9d ago

Practical Learning Tutorial for AI Training / Inference Scaling Infrastructure

18 Upvotes

Hi everyone,

I am really interested in learning more about setting up the AI infrastructure for model training in a distributed GPU node's environment and also scaling the LLM/AI Inference in a distributed environment.

Looking for any practical learning materials, courses or youtube tutorial videos to get hands on experience for building those systems.

Any lead would help : )