Kubernetes in production: complete guide for teams running critical systems

Running Kubernetes in production is not the same discipline as running a development cluster, in the same way that maintaining a commercial aircraft in flight is not the same as ground testing. In a development environment, you can afford permissive ranges, no limits, no NetworkPolicies, and convenience credentials. In production, each of those decisions becomes compounded risk: not theoretical risk, but real incidents, opaque costs, and change windows that shrink as accumulated debt grows.

The pattern we see most often in organizations one or two years into Kubernetes is a cluster that works, but not in a governed way. Pods without resource limits, namespaces with implicit network rules, static secrets in environment variables, and upgrades that have been deferred for months. The system holds together: until load spikes, an audit question surfaces, or an unplanned change touches something nobody documented.

This guide aggregates the concepts and articles that, taken together, give you the tools to operate Kubernetes with genuine control: security by default, data-driven scaling, actionable observability, pragmatic networking, and a delivery model that does not depend on tribal memory.

What this guide covers

The articles below cover specific areas with technical depth. This guide acts as an entry map and cross-cutting complement.

Migrating Ingress NGINX after retirement: guide and alternatives: real options for teams in transition who need to decide without letting urgency dictate architecture
EKS Auto Mode vs Karpenter: which is better for your cluster: intelligent scaling on AWS, when each option pays off and what concrete tradeoffs each implies
DevOps trends 2026: platform engineering, AI, and FinOps: ecosystem context to understand what is changing and what has real traction beyond the hype cycle
GitOps with ArgoCD: benefits and best practices for teams: declarative, reversible delivery with continuous reconciliation, and why that model matters more than the YAML repository
Seamless cloud migration: how to do it right: phased migration pattern that preserves availability and allows rollback if something does not go as planned

Security and governance in Kubernetes

Security in Kubernetes fails systematically for two reasons that are rarely discussed together. The first is that excessive permissions accumulate silently. The second is that the silence ends during an incident.

The Kubernetes identity model is built on RBAC, but RBAC without discipline degrades quickly. What starts as a ClusterRole with broad permissions "so the team can iterate fast" stays there for months, and with it stays an attack surface nobody reviews. In practice, the principle of least privilege is not about reducing permissions until the system breaks: it is about being deliberate about what each service account can do and reviewing that decision on a regular cadence.

Network Policies deserve a specific call-out because their absence generates no visible errors, only latent risk. Without network policies, the default behavior is that any pod can reach any other. That means a local compromise can become lateral movement with minimal effort. It also means that a misconfigured service with a retry loop or loss of control over outgoing connections can degrade the rest of the cluster in ways that are hard to attribute. Introducing NetworkPolicies has a real upfront cost: it forces you to map communication flows that tend to be implicit. That cost buys containment.

Pod Security is the third pillar. The retirement of PodSecurityPolicy and the arrival of Pod Security Admission, alongside projects like Kyverno, changed the operational model but did not change the underlying problem: pods with excessive capabilities, mounting hostPath, or running as root are privilege escalation vectors. Applying restricted or baseline profiles by default and requiring explicit, reviewed exceptions is not bureaucracy: it is defensive engineering.

The supply chain closes the triangle. Scanning images in CI with Trivy or Grype and blocking deployments with critical unpatched CVEs reduces surface area effectively. The gap we see frequently is that scans exist but blocking policies do not. Detecting a vulnerability and deploying it anyway provides no value. Useful detection is the kind that interrupts the flow when risk exceeds an agreed threshold.

Governance completes the model. Admission webhooks, OPA Gatekeeper, or Kyverno as declarative policy engines allow you to codify security decisions as versioned policy. That has an important practical consequence: exceptions become visible, auditable, and reversible. Instead of a privileged pod that "nobody knows why it's like that," you have a documented exception in Git with a justification and an owner.

Scaling and resources

Scaling in Kubernetes is typically discussed as a tooling choice: HPA or VPA, Cluster Autoscaler or Karpenter, CPU or custom metrics. Those choices matter, but the prior question matters more: do your workloads have requests and limits declared with enough precision for the scheduler to make correct decisions?

Without realistic requests, the scheduler assigns pods to nodes based on incorrect information. The result is overcommitted nodes that under pressure produce CPU throttling or memory OOMKills. Both symptoms look like application instability. OOMKills especially, because they imply cache loss, reconnections, and warm-up time that in critical services translate into additional latency someone attributes to "the application" rather than resource configuration.

HPA is useful when the signal is well chosen. Scaling only by CPU is a shortcut that fails systematically in I/O-bound workloads, services waiting on slow dependency responses, or systems that accumulate queue pressure before CPU consumption reflects it. In those cases, KEDA or custom metrics based on queue depth, consumer lag, or active connections allow you to scale by actual system demand rather than an imperfect proxy.

On AWS, the choice between EKS Auto Mode and Karpenter defines the operational model for node scaling. Auto Mode reduces operational burden by ceding control to the managed service. Karpenter gives granular control over instance types, availability zones, aggressive consolidation, and Spot interruption policies. The right choice depends on how much you want to operate, not on which has better marketing. EKS Auto Mode vs Karpenter covers that decision in depth.

VPA complements HPA for workloads where horizontal scaling is not the answer: batch jobs, services with memory that grows based on data loaded at startup, or applications the team cannot or does not want to replicate. VPA in Off mode with visible recommendations is a sensible starting point before enabling automatic adjustment, which requires accepting planned restarts.

Observability in production

The observability that matters is not the kind that records everything: it is the kind that enables decisions under pressure. During on-call, the difference between well-designed and poorly designed observability is not how many dashboards exist, but how long it takes to go from "there is a problem" to "I know what is failing and why."

OpenTelemetry is the de facto standard for instrumentation because it decouples signal collection from the backend where you analyze it. You instrument once and can switch from Jaeger to Tempo, from Prometheus to Mimir, without rewriting application code. In organizations that are growing or evaluating a change of observability provider, that decoupling has real economic value.

The most common stack that balances capability and operational cost in self-managed clusters is Prometheus for metrics, Loki for logs, and Tempo or Jaeger for traces, all surfaced through Grafana. In managed environments, services like Amazon Managed Prometheus or Google Cloud Managed Service for Prometheus eliminate the operational burden of storage and retention in exchange for direct cost.

The maturity step that most reduces MTTR is moving from threshold-based alerts to error budget burn rate alerts. Threshold alerts generate noise because they do not distinguish between "something changed" and "something matters." A burn rate alert on an SLO links the signal to user-facing impact rather than to absolute values that age poorly. That change is not cosmetic: it reduces on-call fatigue and makes alerts actionable again.

Structured JSON logs, correlated by trace ID or request ID, are the complement that makes investigations traceable. When a problem requires jumping from metric to trace to log, correlation by a shared identifier is what allows you to follow the thread in seconds rather than minutes.

Networking and Ingress

The ingress controller is the first point of contact for external traffic and therefore one of the highest-impact points for availability and security. In 2026, the ingress ecosystem is in active transition. Ingress NGINX, which was for years the implicit default controller for many clusters, entered a retirement process as an independent project. That forces many teams to make a decision they had been deferring.

The main alternatives are Ingress NGINX maintained by the Kubernetes community under new governance, Gateway API as the formal successor with more expressiveness and better separation of responsibilities, and proprietary controllers like AWS Load Balancer Controller, NGINX Gateway Fabric, or Traefik. The Ingress NGINX migration guide covers that transition with concrete criteria for teams that need to decide without letting urgency dictate architecture.

Gateway API deserves specific attention because it solves problems that Ingress could never model well: header-based routing, shared routes with separate ownership between teams, retry and timeout policies expressed as first-class resources, and clean integration with service mesh. The jump from Ingress to Gateway API is not trivial, but in organizations with multiple teams sharing the same entry point, the Gateway API role model simplifies governance in a meaningful way.

Service mesh is the natural extension of internal networking when security, observability, or traffic control needs exceed what NetworkPolicies and ingress-level configuration can handle. Istio and Linkerd are the most mature options. The decision to adopt a service mesh should be driven by verified concrete needs, not by the aspiration to have mTLS "because it's good." The operational complexity that a poorly governed mesh adds typically costs more than the risk it mitigates.

Migration to Kubernetes and GitOps

Migrating to Kubernetes without a clear delivery model is moving the problem, not solving it. In many migration projects, the first result is a technically more sophisticated platform running the same manual processes as before, now expressed in YAML rather than bash scripts. The debt changes shape but does not disappear.

GitOps solves the problem at the root because it turns the desired state of the cluster into code that is versioned, reviewed, and continuously reconciled. With ArgoCD as the control loop, the cluster automatically converges to the state Git says it should have. If someone makes a manual change, drift is visible. If a deployment fails, rollback is a git revert. If you need to audit what changed and when, the Git history is the source of truth.

The GitOps with ArgoCD guide covers implementation with Helm, per-environment value separation, and the ApplicationSets model for managing multiple clusters or environments without duplicating configuration.

The migration itself: especially when systems cannot afford downtime: requires a phased transition pattern: replicate, verify, gradually redirect traffic, and only then decommission the source system. Seamless cloud migration covers that pattern with go/no-go criteria for each phase.

Kubernetes audit checklist

A periodic cluster review does not need to be a heavyweight process, but it does need to cover the areas where silent drift accumulates risk. The following table serves as a minimum guide for teams that want to keep the cluster in a governable state without turning the audit into a project.

Area	Audit question	Risk signal
RBAC and permissions	Do all service accounts have documented minimum necessary privileges?	ClusterRoles with `*` in verbs or resources without justification
Network Policies	Is there a default deny policy for all namespaces?	Namespaces with no NetworkPolicy applied
Pod Security	Do all workloads apply a security profile and are exceptions justified?	Pods running as root without documented need
Declared resources	Do all pods have `requests` and `limits` based on measured real consumption?	Pods without `requests` or with CPU `limits` too low causing throttling
Secrets and credentials	Are dynamic identity mechanisms used instead of static credentials in environment variables?	Long-lived secrets hardcoded in manifests or configmaps
Images and supply chain	Does CI block deployments with critical CVEs without a known patch?	Scans that detect but do not block
Ingress and exposure	Does the ingress controller have rate limiting and are public endpoints covered?	Public endpoints with no rate limit
Kubernetes version	Is the cluster version within official support and is the next upgrade planned?	Version less than 60 days from end of support
Add-ons and operators	Are all additional components on versions compatible with the cluster version?	Add-ons on versions not listed as compatible
Observability	Are SLOs defined with burn rate alerts and associated runbooks?	Alerts only by threshold with no runbook or owner
Backups and recovery	Are critical state backups tested with full restoration and measured RTO?	Backups configured but without verified restoration in the last 90 days
Scaling	Is the autoscaler configured with signals that represent actual system pressure?	HPA by CPU on I/O-bound services with no queue or latency metrics

This table is not exhaustive, but it covers the points where we consistently find accumulated debt in clusters that "work fine" until they don't.

When to act

If reading this guide makes you recognize more than three checklist items as areas where your cluster has unresolved debt, the time to act is before an incident forces it. The difference between reviewing with time and reviewing under pressure is not just operational stress: it is the difference between choosing the right solution and choosing the fast solution, which are frequently different.

An external team with production Kubernetes experience can do in days what would take weeks internally: map the actual state of the cluster, prioritize debt by impact and risk, and define an action plan the team can execute without stopping product work.

If that is your situation, the cloud and DevOps consulting page describes how we work and what kinds of situations we handle.