Kubernetes: A Deep Reference for People Who Want to Actually Understand It

I’ve been working through a Kubernetes book recently and the thing that struck me is how much of k8s makes sense once you understand the why behind each piece. A lot of tutorials just teach you to run kubectl apply and call it a day. This post is my attempt to write the reference I wish I had, covering the mental models and the underlying mechanics, not just the commands.

This is long. Get a drink.

1. What is Kubernetes and Why It Exists

The World Before Kubernetes

Imagine you have a Python service packaged as a Docker container. You SSH into a server and run it. Life is good. Then you need three copies of it for redundancy. You SSH into three servers, run the container on each. Still manageable.

Now your service crashes on server two at 3am. Nobody restarts it. Now you have two copies. Traffic spikes, you need ten copies. You SSH into seven more servers. The image updated, so you need to rolling-update all ten servers without taking the service down. One server runs out of disk. Another has a memory leak from a different service eating into what your container needs.

This is the chaos that existed before container orchestration. People wrote custom bash scripts, used tools like Fabric or Ansible to coordinate deployments, and had on-call rotations specifically to babysit crashed containers. It was operational misery at scale.

What Kubernetes Actually Solves

Kubernetes is a container orchestrator. But more precisely, it is a system that implements a declarative desired-state model with automatic reconciliation.

That sentence is the most important thing to understand about k8s. Let me break it down.

Declarative: You don’t tell k8s how to do something. You tell it what you want the world to look like. “I want three copies of this container running.” Not “spin up container on server A, then server B, then server C.”

Desired state: Your declarations are stored as the desired state of the cluster. k8s holds this in a database (etcd) and constantly works toward making reality match it.

Automatic reconciliation: k8s runs continuous loops that compare the actual state of the cluster against the desired state. When they diverge, k8s takes action to bring them back in line. A pod crashes? k8s notices the actual count (2) doesn’t match desired count (3) and schedules a new one.

The Mental Model

Think of Kubernetes like a thermostat, not like a remote control.

A remote control is imperative: you press the button, the TV changes channel. If it doesn’t change, you press again. You are the feedback loop.

A thermostat is declarative: you set 72 degrees. The thermostat measures the actual temperature, compares it to 72, and runs the heater or AC until they match. You just state what you want.

With k8s, you write YAML that describes your desired world. You hand it to k8s. k8s figures out the sequence of steps needed to make reality match your description, executes them, and keeps checking forever. If reality drifts away from your description (node dies, container crashes, someone manually kills a pod), k8s corrects it.

This is a fundamentally different operational model. The system is self-healing by design, not by accident.

2. Architecture: The Control Plane

Kubernetes has two kinds of nodes: control plane nodes (the brain) and worker nodes (where your workloads actually run). Let’s go through the control plane first.

Control Plane
+----------------------------------------------------------+
|                                                          |
|   kube-apiserver  <-->  etcd                             |
|         ^                                                |
|         |                                                |
|   kube-scheduler                                         |
|   kube-controller-manager                                |
|   cloud-controller-manager                               |
|                                                          |
+----------------------------------------------------------+
          |
          | (watches API server)
          v
+------------------+     +------------------+
|   Worker Node 1  |     |   Worker Node 2  |
|   kubelet        |     |   kubelet        |
|   kube-proxy     |     |   kube-proxy     |
|   containerd     |     |   containerd     |
+------------------+     +------------------+

kube-apiserver: The Front Door

The API server is the single entry point for all communication in a Kubernetes cluster. Everything, including the scheduler, controller-manager, kubelet, and kubectl, talks to the cluster by making REST calls to the API server. Nothing talks to etcd directly except the API server.

When you run kubectl apply -f deployment.yaml, kubectl reads your kubeconfig, finds the server address, and makes an HTTP PUT or POST request to the API server. The API server authenticates you (is this a valid token?), authorizes you (does this user have permission to create deployments?), runs admission controllers (are there any policies that need to modify or reject this request?), validates the object schema, and finally writes it to etcd.

The API server also supports watch semantics. A client can say “give me all pods, and then keep sending me events as pods are created/updated/deleted.” This is how the scheduler knows when a new pod without a node assignment shows up. This is how the kubelet knows when a pod has been scheduled to its node. The entire coordination model is built on top of this watch mechanism.

etcd: The Source of Truth

etcd is a distributed key-value store that Kubernetes uses as its database. Every object you create, every pod, every deployment, every config, lives in etcd. It is the only persistent storage in a k8s cluster.

etcd uses the Raft consensus algorithm, which means it needs a quorum (majority) of its members to agree on writes. In production, you run etcd as a cluster of 3 or 5 nodes for high availability. If etcd goes down, the cluster doesn’t immediately break (existing pods keep running), but you can’t make any changes, nothing can be scheduled, and no reconciliation happens.

Understanding etcd matters because it explains why etcd backups are critical. If you lose etcd without a backup, you lose all knowledge of what is supposed to be running. The worker nodes still have containers running, but the control plane has no memory of them.

kube-scheduler: Finding a Home for Pods

The scheduler watches for pods that have been created but not yet assigned to a node (these are called “pending” pods). When it finds one, it runs a two-phase process to pick a node.

Filter phase (predicates): The scheduler eliminates nodes that can’t run the pod. Reasons include insufficient CPU or memory, node taints that the pod doesn’t tolerate, node selectors that don’t match, pod anti-affinity rules, etc.

Score phase (priorities): From the remaining nodes, the scheduler scores each one. It considers things like spreading pods across failure domains, preferring nodes with the container image already cached, balancing resource utilization, etc. The node with the highest score wins.

Once a node is selected, the scheduler writes the node name into the pod’s spec in etcd. The kubelet on that node sees this (via a watch) and starts running the pod.

The scheduler doesn’t start the container. It just assigns the pod to a node. Execution is the kubelet’s job.

kube-controller-manager: The Reconciliation Engine

The controller manager is a single binary that runs dozens of controllers. Each controller watches specific resources and reconciles them.

The Deployment controller watches Deployments and ReplicaSets. When you create a Deployment, the Deployment controller creates a ReplicaSet. When you update the Deployment (new image version), the Deployment controller creates a new ReplicaSet, scales it up, and scales the old one down, implementing the rolling update.

The ReplicaSet controller watches ReplicaSets and Pods. Its job is simple: make sure the number of running pods matches the replicas field. If a pod dies, it creates a new one. If there are too many pods, it deletes some.

The Node controller watches Node objects and marks them as not-ready when they stop sending heartbeats. It evicts pods from nodes that have been unreachable too long.

There are controllers for services, endpoints, namespaces, persistent volumes, and many more. They all follow the same pattern: watch, diff, act.

cloud-controller-manager

When k8s runs on a cloud provider (AWS, GCP, Azure), the cloud-controller-manager handles integrations with cloud APIs. When you create a Service of type LoadBalancer, the cloud controller is what actually calls the AWS API to create a Network Load Balancer. When a node joins the cluster, the cloud controller retrieves the node’s instance ID, region, and availability zone from the cloud metadata API and annotates the node with this information.

This component is why Kubernetes is cloud-agnostic: the core of k8s doesn’t know about AWS or GCP. That knowledge lives in the cloud-controller-manager, which is provided by the cloud vendor.

3. Architecture: Worker Nodes

kubelet: The Node Agent

Every worker node runs a kubelet. It is the direct interface between k8s and the container runtime on that node.

The kubelet watches the API server for pods that have been scheduled to its node. When it sees one, it instructs the container runtime (containerd) to pull the image and start the container. It then monitors the container and reports status back to the API server.

The kubelet also runs liveness and readiness probes. If a liveness probe fails, the kubelet restarts the container. It manages volume mounts, secrets, and configmaps for pods. It reports node resource usage. It is the busiest component on a worker node.

The kubelet is also the component that enforces resource requests and limits. It works with the Linux kernel’s cgroups to constrain how much CPU and memory each container can use.

kube-proxy: The Network Rules Manager

kube-proxy runs on every node and manages the network rules that implement Kubernetes Services. When you create a Service, kube-proxy watches for it and programs the node’s network stack accordingly.

In iptables mode (the default on most clusters), kube-proxy writes iptables rules that intercept traffic destined for a Service’s ClusterIP and randomly forward it to one of the backing pods. When a pod is added or removed from a Service, kube-proxy updates the iptables rules.

In IPVS mode (more performant at scale), kube-proxy uses Linux IPVS (IP Virtual Server) which is purpose-built for load balancing and handles large numbers of rules more efficiently.

Note: kube-proxy does not actually proxy the traffic in userspace anymore in modern clusters. It just programs kernel-level rules. The actual packets are forwarded by the kernel, not by the kube-proxy process.

Container Runtime

The container runtime is what actually runs containers. Docker used to be the default but Kubernetes moved away from it.

Modern clusters use containerd (a graduated CNCF project) or CRI-O. Both implement the Container Runtime Interface (CRI), which is the standard API the kubelet uses to talk to the runtime. This abstraction is why you can swap runtimes without changing the rest of k8s.

containerd handles pulling images, managing the lifecycle of containers, and exposing the CRI gRPC endpoint that kubelet calls. Under the hood, containerd uses runc (the OCI runtime) to actually create the container, which makes system calls to Linux namespaces and cgroups.

4. Core Objects

Pod

The pod is the smallest deployable unit in Kubernetes. A pod is a group of one or more containers that share a network namespace and a set of storage volumes.

Sharing a network namespace means all containers in a pod share the same IP address and port space. They can communicate with each other via localhost. This is why you put a web server and its sidecar proxy (like Envoy) in the same pod: they need to intercept each other’s traffic.

Pods are ephemeral. They are not rescheduled when they die. They are not migrated when a node goes down. A pod that dies stays dead. This is intentional: higher-level controllers (Deployments, StatefulSets) are responsible for creating new pods when old ones die.

Never run a pod directly in production. Always use a controller.

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
  labels:
    app: example
spec:
  containers:
  - name: web
    image: nginx:1.25
    ports:
    - containerPort: 80

ReplicaSet

A ReplicaSet ensures that a specified number of pod replicas are running at any time. It uses a label selector to identify which pods it owns.

You almost never create ReplicaSets directly. Deployments manage ReplicaSets for you. Understanding ReplicaSets is still important because they’re what actually implements the “N copies” guarantee.

If you delete a pod that’s owned by a ReplicaSet, the ReplicaSet creates a replacement. If you create a new pod with labels that match a ReplicaSet’s selector, the ReplicaSet may delete it to stay at the desired count. Labels are everything.

Deployment

A Deployment is the standard way to run stateless applications. It manages a ReplicaSet and adds the ability to roll out updates and roll them back.

When you update a Deployment (change the image, for example), the Deployment controller creates a new ReplicaSet with the new configuration, scales it up, and scales the old ReplicaSet down. Old ReplicaSets are kept around (but scaled to 0) so you can roll back to them.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web
        image: myapp:2.1.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 15
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
          failureThreshold: 3

StatefulSet

StatefulSets are for workloads that need persistent identity or stable storage. Databases, message queues, distributed systems.

The key differences from a Deployment:

  • Pods have stable, predictable names: app-0, app-1, app-2.
  • Pods are created and deleted in order (app-0 before app-1, etc.).
  • Each pod gets its own PersistentVolumeClaim, and that PVC follows the pod if it’s rescheduled.
  • The pod’s network identity (DNS hostname) is stable across restarts.

If you run a three-node MySQL cluster in k8s, you need to know which pod is the primary. With a StatefulSet, pod 0 is always pod 0. You can hardcode the peer discovery logic around mysql-0.mysql.default.svc.cluster.local.

DaemonSet

A DaemonSet ensures that every node (or a subset of nodes) runs a copy of a pod. As new nodes join the cluster, a pod is automatically added to them. When nodes are removed, the pods are garbage collected.

Use cases: log collectors (Fluentd, Filebeat), monitoring agents (Prometheus node exporter), network plugins, storage agents. Anything that needs to run on every node.

Job and CronJob

A Job creates one or more pods and tracks their completion. When a pod completes successfully, the Job is done. If the pod fails, the Job retries it up to a specified number of times. Use Jobs for batch workloads: data processing, database migrations, report generation.

A CronJob creates Jobs on a schedule, using the familiar cron expression syntax. Use CronJobs for periodic tasks: nightly cleanup, hourly data exports, cache warming.

Namespace

A Namespace is a virtual cluster inside a physical cluster. It provides a scope for names: a Deployment named web-app in namespace staging is different from one in namespace production. Resource quotas and RBAC policies can be scoped to a namespace.

Namespaces don’t provide network isolation by default. Pods in different namespaces can still talk to each other unless you implement NetworkPolicies. Think of namespaces as organizational units, not security boundaries (unless you add network policies).

5. Networking

Kubernetes networking is built on a few core rules:

  • Every pod gets its own unique IP address.
  • Pods can communicate with any other pod in the cluster without NAT.
  • Nodes can communicate with any pod without NAT.

These rules are implemented by the CNI plugin, not by Kubernetes itself. Let’s go through how it all works.

Services

A pod’s IP address changes every time the pod is recreated. You can’t hardcode pod IPs. Services solve this: a Service is a stable virtual IP that load-balances to a set of pods selected by labels.

ClusterIP: The default type. Creates a virtual IP that’s only reachable from inside the cluster. Traffic to the ClusterIP is forwarded to one of the backing pods.

apiVersion: v1
kind: Service
metadata:
  name: web-app-svc
  namespace: production
spec:
  selector:
    app: web-app
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: ClusterIP

NodePort: Exposes the service on a static port on every node’s IP. Traffic to NODE_IP:NODE_PORT is forwarded to the service. Useful for development or when you don’t have a load balancer. Port range is 30000-32767.

LoadBalancer: Tells the cloud provider to provision an external load balancer that routes traffic into the cluster. This is how you expose services to the internet in production on cloud clusters.

ExternalName: A DNS alias. Returns a CNAME record pointing to an external DNS name. Useful for integrating external services into k8s service discovery.

How kube-proxy Implements Services with iptables

When you create a Service, the Endpoints controller watches for pods matching the selector and creates an Endpoints object listing their IPs and ports. kube-proxy watches Services and Endpoints and programs iptables rules.

For a ClusterIP service at 10.96.0.100:80, kube-proxy writes iptables rules roughly like:

PREROUTING -> KUBE-SERVICES -> KUBE-SVC-XXXX -> KUBE-SEP-YYY1 (33% chance) -> DNAT to pod1:8080
                                              -> KUBE-SEP-YYY2 (33% chance) -> DNAT to pod2:8080
                                              -> KUBE-SEP-YYY3 (33% chance) -> DNAT to pod3:8080

The kernel evaluates these rules for every packet. When a packet’s destination is the ClusterIP, it gets DNAT’d (Destination NAT) to one of the actual pod IPs. The response packets are reverse NAT’d. The application has no idea any of this is happening.

This iptables approach works but has scaling problems: with thousands of services, the iptables ruleset grows massive and updating it takes time. IPVS mode solves this with a dedicated kernel load balancer.

Ingress

Services expose your app inside the cluster or as a raw TCP/UDP load balancer. Ingress is for HTTP/HTTPS routing.

An Ingress resource defines routing rules: “requests to api.example.com/v1 go to the api-service, requests to app.example.com go to the web-service.” An Ingress Controller (nginx, HAProxy, Traefik, AWS ALB) implements these rules.

Ingress handles TLS termination too. You put your TLS certificate in a Secret and reference it from the Ingress. The Ingress controller decrypts the HTTPS traffic and forwards plain HTTP to your pods.

DNS and Service Discovery

CoreDNS runs in the cluster and provides DNS resolution. Every Service gets a DNS name:

<service-name>.<namespace>.svc.cluster.local

So a Service named postgres in namespace production is reachable at postgres.production.svc.cluster.local. From a pod in the same namespace, you can just use postgres. The DNS search domains are configured automatically in each pod’s /etc/resolv.conf.

Pods also get DNS names in the format pod-ip-dashes.namespace.pod.cluster.local, though these are rarely used directly.

CNI Plugins

The Container Network Interface (CNI) is the standard that container runtimes use to configure networking for pods. When a pod is created, the kubelet calls the CNI plugin to set up the pod’s network interface, assign it an IP, and configure routing.

Flannel: Simple overlay network. Uses VXLAN to encapsulate pod-to-pod traffic. Easy to set up, low features.

Calico: Uses BGP to route pod traffic without encapsulation (in its default mode). Also provides robust NetworkPolicy implementation. Good performance.

Cilium: Uses eBPF (Extended Berkeley Packet Filter) to implement networking entirely in the kernel, bypassing iptables entirely. Best performance, deep observability, and security features. Increasingly the default choice for new clusters.

6. Storage

Volumes vs PersistentVolumes

A Volume in Kubernetes is a directory accessible to containers in a pod. It solves the basic problem of data disappearing when a container restarts: the volume outlives the container. But a Volume’s lifetime is tied to the pod. When the pod is deleted, the volume is gone.

A PersistentVolume (PV) is a piece of storage in the cluster that has a lifecycle independent of any pod. It’s a cluster-level resource, like a node. It could be an NFS share, an AWS EBS volume, a GCP Persistent Disk, etc.

PersistentVolumeClaims

A PersistentVolumeClaim (PVC) is a request for storage by a user. It’s like a pod claiming CPU and memory from a node, but for storage. You specify the size, access mode (ReadWriteOnce, ReadWriteMany, ReadOnlyMany), and optionally a StorageClass.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: production
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: fast-ssd

You reference the PVC in your pod spec:

spec:
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: postgres-data
  containers:
  - name: postgres
    image: postgres:16
    volumeMounts:
    - name: data
      mountPath: /var/lib/postgresql/data

StorageClasses and Dynamic Provisioning

Before StorageClasses, you had to manually create PersistentVolumes before a PVC could be bound to them. This was tedious.

StorageClasses enable dynamic provisioning: when a PVC is created that references a StorageClass, the class’s provisioner automatically creates a PV for it. On AWS, the gp2 StorageClass provisioner calls the AWS API to create an EBS volume. On GCP, it creates a Persistent Disk.

You define what kind of storage you want (fast SSD, cheap HDD, etc.) as named StorageClasses, and users just claim by class name. The underlying infrastructure detail is abstracted away.

ConfigMaps and Secrets

ConfigMaps store non-sensitive configuration data as key-value pairs. You can consume them in three ways: as environment variables, as command-line arguments, or as files mounted in a volume.

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
data:
  DATABASE_HOST: "postgres.production.svc.cluster.local"
  DATABASE_PORT: "5432"
  LOG_LEVEL: "info"
  config.yaml: |
    server:
      port: 8080
      timeout: 30s
    feature_flags:
      new_ui: false

Using a ConfigMap as environment variables:

spec:
  containers:
  - name: app
    image: myapp:latest
    envFrom:
    - configMapRef:
        name: app-config

Or selectively:

    env:
    - name: DB_HOST
      valueFrom:
        configMapKeyRef:
          name: app-config
          key: DATABASE_HOST

Secrets work similarly but are meant for sensitive data (passwords, tokens, TLS certs). By default, Secrets are stored base64-encoded in etcd. Note that base64 is not encryption. In production, you should enable etcd encryption at rest or use an external secrets manager (HashiCorp Vault, AWS Secrets Manager) with something like the External Secrets Operator.

7. kubectl: The CLI

How It Works

kubectl is a Go binary that reads your kubeconfig file (default ~/.kube/config), constructs REST API calls, and sends them to the kube-apiserver. It’s not magic: every kubectl command corresponds to an HTTP request.

kubectl get pods makes a GET to /api/v1/namespaces/default/pods. kubectl apply -f deployment.yaml makes a POST or PATCH to the appropriate resource endpoint.

You can see the actual HTTP calls with kubectl --v=8 get pods (verbosity level 8 shows full request/response).

kubeconfig

The kubeconfig file has three sections:

clusters: Defines API server endpoints and their CA certificates. users: Defines credentials (tokens, client certificates, exec plugins). contexts: Binds a cluster, a user, and a default namespace together.

apiVersion: v1
kind: Config
clusters:
- name: production
  cluster:
    server: https://prod-api.example.com:6443
    certificate-authority-data: <base64-encoded-ca-cert>
users:
- name: prod-admin
  user:
    token: <service-account-token>
contexts:
- name: prod-context
  context:
    cluster: production
    user: prod-admin
    namespace: default
current-context: prod-context

Switch contexts with kubectl config use-context CONTEXT_NAME. Install kubectx and kubens for faster context and namespace switching. They are essential quality-of-life tools.

Essential Commands

Get resources:

# List pods in current namespace
kubectl get pods

# List pods in all namespaces
kubectl get pods --all-namespaces

# List pods with more info (node, IP)
kubectl get pods -o wide

# Get a specific pod as YAML
kubectl get pod my-pod -o yaml

# Watch pods in real time
kubectl get pods --watch

# List deployments in a specific namespace
kubectl get deployments -n production

Describe gives you the full story of a resource, including events:

kubectl describe pod my-pod-abc123
kubectl describe node worker-node-1

The Events section in describe output is invaluable for debugging. If a pod is stuck in Pending, kubectl describe pod tells you why: no nodes match selector, insufficient resources, image pull failure, etc.

Logs:

# Get logs from a pod
kubectl logs my-pod

# Follow logs in real time
kubectl logs -f my-pod

# Get logs from a specific container in a multi-container pod
kubectl logs my-pod -c sidecar

# Get logs from the previous instance of a container (after a crash)
kubectl logs my-pod --previous

# Get last 100 lines
kubectl logs my-pod --tail=100

Exec into a running container:

# Get a shell
kubectl exec -it my-pod -- /bin/bash

# Run a single command
kubectl exec my-pod -- env

# In a multi-container pod
kubectl exec -it my-pod -c web -- /bin/sh

Apply and delete:

# Apply a manifest (create or update)
kubectl apply -f deployment.yaml

# Apply all manifests in a directory
kubectl apply -f ./k8s/

# Delete a resource
kubectl delete -f deployment.yaml

# Delete by name
kubectl delete pod my-pod

# Delete all pods matching a label
kubectl delete pods -l app=web-app

Rollout commands:

# Check rollout status
kubectl rollout status deployment/web-app

# See rollout history
kubectl rollout history deployment/web-app

# Rollback to previous version
kubectl rollout undo deployment/web-app

# Rollback to a specific revision
kubectl rollout undo deployment/web-app --to-revision=3

Scaling:

# Scale a deployment
kubectl scale deployment web-app --replicas=5

# Scale to zero (stop all pods without deleting the deployment)
kubectl scale deployment web-app --replicas=0

Port forwarding (great for development and debugging):

# Forward local port 8080 to pod port 80
kubectl port-forward pod/my-pod 8080:80

# Forward to a service (load-balanced across pods)
kubectl port-forward service/web-app-svc 8080:80

Resource usage (requires metrics-server addon):

kubectl top nodes
kubectl top pods
kubectl top pods -n production --sort-by=memory

Inline docs (this is underused and extremely useful):

# Explain a resource
kubectl explain deployment

# Explain a nested field
kubectl explain deployment.spec.strategy

# Explain pod resource limits
kubectl explain pod.spec.containers.resources

Dry run (validate without applying):

kubectl apply -f deployment.yaml --dry-run=client
kubectl apply -f deployment.yaml --dry-run=server

Server-side dry run actually sends the request to the API server and runs admission controllers, giving you a more accurate preview.

Labels and Selectors

Labels are key-value pairs attached to objects. Selectors filter objects by label. This is the glue that ties Kubernetes together.

A Service’s selector determines which pods receive traffic. A ReplicaSet’s selector determines which pods it manages. DaemonSet, StatefulSet, Job all use label selectors.

# Add a label to a pod
kubectl label pod my-pod env=staging

# Filter by label
kubectl get pods -l app=web-app

# Multiple label selectors
kubectl get pods -l app=web-app,env=production

# Set-based selector
kubectl get pods -l 'env in (staging, production)'

Annotations are similar to labels but are meant for non-identifying metadata: tool configurations, deployment timestamps, links to documentation. They’re not used for selecting objects.

Useful Flags Summary

-n <namespace>           # Target a specific namespace
--all-namespaces         # Operate across all namespaces
-o yaml                  # Output as YAML
-o json                  # Output as JSON
-o wide                  # Extra columns
-o jsonpath='{.status.podIP}'   # Extract specific fields
--watch                  # Stream updates
--dry-run=client         # Validate without applying
--dry-run=server         # Server-side validation
-l <selector>            # Filter by label selector
--field-selector         # Filter by field value
--force                  # Force delete (use carefully)
--grace-period=0         # Immediate deletion

8. Minikube

What It Is

Minikube runs a single-node Kubernetes cluster on your local machine. It’s the standard way to develop and test k8s locally without a cloud account.

The entire control plane (API server, scheduler, controller-manager, etcd) and a single worker node run together, either in a VM or a Docker container on your machine.

How It Works

Minikube uses a “driver” to provision the runtime environment:

docker driver: Runs the cluster inside a Docker container. Fast to start, no VM overhead. Default on Linux and most Mac setups.

virtualbox driver: Runs a VM using VirtualBox. More isolated from the host.

hyperkit driver: Mac-native hypervisor, faster than VirtualBox on macOS.

kvm2 driver: Linux KVM hypervisor.

Inside the VM or container, Minikube installs and runs all the k8s components. For the container runtime, it typically uses containerd.

Key Commands

# Start a cluster (latest stable k8s)
minikube start

# Start with a specific k8s version
minikube start --kubernetes-version=v1.29.0

# Start with more resources
minikube start --cpus=4 --memory=8192

# Use a specific driver
minikube start --driver=docker

# Check cluster status
minikube status

# Stop the cluster (preserves state)
minikube stop

# Delete the cluster entirely
minikube delete

# SSH into the minikube node
minikube ssh

# Open the Kubernetes dashboard
minikube dashboard

# Get the cluster IP
minikube ip

Addons

Minikube has an addon system for optional cluster components:

# List all addons
minikube addons list

# Enable metrics-server (for kubectl top)
minikube addons enable metrics-server

# Enable the Ingress controller
minikube addons enable ingress

# Enable the dashboard
minikube addons enable dashboard

Accessing Services

# Open a NodePort or LoadBalancer service in your browser
minikube service web-app-svc

# Get the URL without opening the browser
minikube service web-app-svc --url

# For LoadBalancer services, minikube tunnel creates a route
minikube tunnel

The minikube tunnel command is important. In a real cloud cluster, a LoadBalancer service gets a real external IP from the cloud provider. In Minikube, LoadBalancer services stay in “pending” state unless you run minikube tunnel, which sets up a network route from your machine into the cluster and assigns a local IP.

Limitations vs Real Clusters

Minikube is excellent for learning and development but has real limitations:

  • Single node: no multi-node scheduling, no actual node failure testing.
  • No real cloud integrations: LoadBalancer behavior is simulated.
  • Resource constrained: limited by your laptop’s RAM and CPU.
  • Performance: VM overhead means slower than a real cluster.
  • Not meant for production workloads.

For multi-node local testing, look at kind (k8s in Docker) which lets you run multiple nodes in Docker containers on a single machine.

9. The Reconciliation Loop (Deep Theory)

This is the most important thing to understand about how k8s works under the hood.

The Controller Pattern

Every controller in k8s follows the same pattern:

1. Watch the API server for relevant events
2. Compute the diff between desired state and actual state
3. Take action to reduce the diff
4. Repeat

The ReplicaSet controller watches for:

  • ReplicaSet objects (desired state)
  • Pod objects with matching labels (actual state)

On every relevant change, it computes: do I have the right number of pods? If actual is less than desired, create pods. If actual is more than desired, delete pods.

This loop runs continuously. It doesn’t run once and stop. Even if nothing changes, controllers re-check their state periodically (resync) to catch any drift.

Level-Triggered vs Edge-Triggered

This is a concept borrowed from electronics that’s directly relevant to how k8s controllers work.

Edge-triggered: React to changes. “A pod was deleted” triggers the reconciliation.

Level-triggered: React to state. “The current state is: 2 pods running” compared against “desired state: 3 pods running” triggers reconciliation.

Kubernetes controllers are level-triggered. This is a deliberate design choice.

Edge-triggered systems can miss events. If you’re only listening for “pod deleted” events and you miss one (network partition, process restart), you’ll have the wrong count and never correct it.

Level-triggered systems don’t have this problem. Even if you missed the deletion event, the next time the controller looks at actual state (2 pods) vs desired state (3 pods), it will create a pod. The system converges to the correct state regardless of which events were or weren’t processed.

This is why k8s controllers also do periodic resyncs: they re-examine the full state at regular intervals, not just on events.

Why Kubernetes is Eventually Consistent

k8s does not guarantee that your desired state is immediately achieved. It guarantees that it will eventually be achieved (assuming the system has the capacity to achieve it).

If you create a Deployment with 10 replicas, the pods don’t all start simultaneously. The scheduler picks nodes for them, the kubelets pull images, containers start, readiness probes pass, the service endpoints are updated… this all takes time. For a few seconds or minutes, the actual state is different from the desired state.

This is fine. The system is continuously converging. You need to design your applications to tolerate this. Use readiness probes so traffic doesn’t hit unready pods. Use graceful shutdown so in-flight requests complete. Don’t assume a pod is immediately available after a deployment.

What Happens When You kubectl apply

Let’s trace the full path of kubectl apply -f deployment.yaml:

1.  kubectl reads kubeconfig, constructs HTTP request
2.  kubectl sends POST/PATCH to kube-apiserver
3.  API server authenticates the request (token, client cert)
4.  API server authorizes the request (RBAC: can this user create Deployments?)
5.  Admission controllers run:
    - Mutating webhooks (may modify the object, add defaults)
    - Validating webhooks (may reject the object)
    - Built-in admission: LimitRanger (apply default limits), etc.
6.  API server validates the object schema
7.  API server writes the Deployment to etcd
8.  API server returns 200 OK to kubectl

(Asynchronously, in parallel from here)

9.  Deployment controller (in controller-manager) gets notified via watch
10. Deployment controller creates a ReplicaSet
11. ReplicaSet controller gets notified via watch
12. ReplicaSet controller creates Pod objects (no node assigned yet)
13. Scheduler gets notified via watch (new pending pods)
14. Scheduler runs filter + score for each pod
15. Scheduler writes node name into pod spec in etcd
16. kubelet on chosen node gets notified via watch
17. kubelet calls containerd to pull image (if not cached)
18. kubelet creates containers
19. kubelet runs startup probes
20. kubelet runs readiness probes
21. Endpoints controller updates Service Endpoints when pods become ready
22. kube-proxy sees new Endpoints, updates iptables rules
23. Traffic can now reach the new pods

This whole process takes seconds to a minute depending on image pull times. Steps 9-23 are all asynchronous and concurrent.

10. Resource Management

Requests vs Limits

Every container in a pod can specify resource requests and limits for CPU and memory.

Requests are what the container is guaranteed to get. The scheduler uses requests to decide which node to place a pod on. If you request 500m CPU, the scheduler only places you on a node that has at least 500m unallocated CPU.

Limits are the maximum the container is allowed to use. If a container tries to use more CPU than its limit, it is throttled (slowed down). If it tries to use more memory than its limit, it is OOMKilled (the kernel terminates the process).

CPU is specified in millicores: 1000m = 1 CPU core. 250m = 0.25 cores. Memory is in bytes with SI suffixes: 256Mi = 256 mebibytes, 1Gi = 1 gibibyte.

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "512Mi"

Setting requests and limits is not optional in production. Without them, pods can consume all resources on a node, starving other pods, and the scheduler can’t make good placement decisions.

QoS Classes

Based on how you set requests and limits, Kubernetes assigns each pod one of three Quality of Service classes. This affects which pods get killed first when a node runs out of memory.

Guaranteed: All containers have equal requests and limits for both CPU and memory. These pods are the last to be evicted. Set this for critical services.

Burstable: At least one container has requests set, and requests != limits (or limits aren’t set). These pods are evicted after BestEffort pods when memory pressure occurs.

BestEffort: No containers have any requests or limits set. These pods are the first to be evicted. Don’t use this in production.

The QoS class is not set by you directly. It’s computed from the requests and limits you set. This is one reason to always set both requests and limits.

LimitRange and ResourceQuota

LimitRange sets default requests and limits for containers in a namespace. If a container is deployed without specifying resources, LimitRange fills in the defaults. It also enforces min/max bounds.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: staging
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: "256Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    max:
      cpu: "2"
      memory: "2Gi"

ResourceQuota limits the total resources consumed by all objects in a namespace. It can limit the total number of pods, total CPU requests, total memory limits, number of PVCs, etc.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "4"
    requests.memory: "8Gi"
    limits.cpu: "8"
    limits.memory: "16Gi"
    pods: "20"
    persistentvolumeclaims: "10"

Why OOMKilled Happens and How to Prevent It

OOMKilled happens when a container’s memory usage exceeds its limit. The Linux kernel’s OOM killer terminates the process. In k8s, this shows up as OOMKilled in kubectl describe pod and the container restarts.

Common causes:

  • Memory limit set too low for the workload.
  • Memory leak in the application.
  • Unexpected traffic spike causing more in-memory data.
  • JVM (Java) heap size configured to a value higher than the container’s memory limit. Java may not respect cgroup memory limits without JVM flags.

Prevention:

  • Profile your application’s memory usage under realistic load before setting limits.
  • Set limits with some headroom above the normal working set.
  • For JVM apps, set -XX:MaxRAMPercentage=75.0 to tell the JVM to use at most 75% of available memory.
  • Monitor memory usage over time and adjust limits based on actual data.
  • Implement memory profiling and alerting before limits are hit.

11. Health Checks

The Three Probes

Kubernetes supports three types of probes for understanding a container’s health state.

Liveness Probe: Answers “is this container alive?” If it fails, the kubelet kills the container and restarts it (subject to the restart policy). Use this to detect deadlocks and infinite loops that don’t crash the process but do make it useless.

Readiness Probe: Answers “is this container ready to serve traffic?” If it fails, the pod is removed from the Service’s endpoint list. Traffic stops being sent to it. The pod is not killed. Use this to prevent traffic from going to pods that are warming up, overloaded, or temporarily unable to serve requests.

Startup Probe: A special case for slow-starting containers. While the startup probe is running, liveness and readiness probes are disabled. If the startup probe succeeds, normal liveness/readiness probes kick in. If it fails after the configured threshold, the container is killed. Use this for legacy applications that take minutes to initialize.

Probe Types

httpGet: Makes an HTTP GET request. Success is any 2xx or 3xx status code.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:
    - name: X-Health-Check
      value: liveness
  initialDelaySeconds: 10
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1

tcpSocket: Tries to open a TCP connection. Success means the port is accepting connections.

livenessProbe:
  tcpSocket:
    port: 5432
  initialDelaySeconds: 15
  periodSeconds: 10

exec: Runs a command inside the container. Success is exit code 0.

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - "redis-cli ping | grep PONG"
  initialDelaySeconds: 5
  periodSeconds: 10

Probe Configuration Parameters

  • initialDelaySeconds: Wait this long after container start before the first probe. Prevents probes from running before the app has had a chance to start.
  • periodSeconds: How often to run the probe.
  • timeoutSeconds: How long to wait for a response before counting as failure.
  • failureThreshold: How many consecutive failures before the probe is considered failed.
  • successThreshold: How many consecutive successes (after failure) before the probe is considered successful again. Must be 1 for liveness and startup probes.

Getting these values right matters. Too aggressive (short timeouts, low failure threshold) and you’re restarting healthy pods under momentary load. Too loose (long periods, high failure threshold) and you’re sending traffic to dead pods for too long.

12. Rolling Updates and Rollbacks

How Rolling Updates Work

When you update a Deployment (change the image, for example), the Deployment controller doesn’t kill all old pods and start all new ones simultaneously. That would cause downtime. It does a rolling update.

The two key parameters are:

maxUnavailable: The maximum number of pods that can be unavailable during the update. Can be an absolute number or a percentage of desired replicas. Default is 25%.

maxSurge: The maximum number of pods that can exist above the desired count during the update. Can be absolute or percentage. Default is 25%.

With 10 replicas, maxUnavailable=2, maxSurge=2, the rolling update goes roughly like:

Initial: 10 old pods running, 10/10 available

Step 1: Create 2 new pods (total: 12, surge=2)
Step 2: When 2 new pods are ready, terminate 2 old pods (total: 10, 8 old + 2 new)
Step 3: Create 2 more new pods (total: 12)
Step 4: When those are ready, terminate 2 more old pods
...
Final: 10 new pods running, 10/10 available

The critical point: new pods must pass their readiness probe before old pods are terminated. This is why readiness probes are essential for zero-downtime deployments.

Rollout Commands

# Trigger a rollout by changing the image
kubectl set image deployment/web-app web=myapp:2.2.0

# Watch the rollout progress
kubectl rollout status deployment/web-app

# Pause a rollout (useful for canary-like manual progression)
kubectl rollout pause deployment/web-app

# Resume a paused rollout
kubectl rollout resume deployment/web-app

# View rollout history
kubectl rollout history deployment/web-app

# View details of a specific revision
kubectl rollout history deployment/web-app --revision=3

# Rollback to previous revision
kubectl rollout undo deployment/web-app

# Rollback to a specific revision
kubectl rollout undo deployment/web-app --to-revision=2

Why Readiness Probes are Essential for Zero-Downtime

Without readiness probes, Kubernetes considers a pod ready as soon as its containers are running. The pod gets added to the Service endpoints immediately. But “container is running” does not mean “application is ready to handle requests.” The app might still be:

  • Running database migrations
  • Warming up caches
  • Establishing connection pools
  • Loading ML models into memory

If the rolling update terminates old pods and adds unready new pods, requests fail during that window.

With readiness probes, the new pod is only added to the Service endpoints after the probe passes. The old pod stays in the endpoints until the new pod is confirmed ready. This creates the overlap that ensures zero-downtime.

The readiness probe also acts as a circuit breaker during the rollout. If your new version is broken and the readiness probe never passes, the rolling update stops. The maxUnavailable old pods are still running. You haven’t taken down your whole service with a bad deploy. The Deployment is in a partially-updated state, and you can rollback.

Putting It All Together: A Production-Ready Example

Here’s a complete example of a production-grade application deployment with all the concepts above:

apiVersion: v1
kind: ConfigMap
metadata:
  name: web-app-config
  namespace: production
data:
  APP_ENV: "production"
  LOG_LEVEL: "warn"
  DB_HOST: "postgres.production.svc.cluster.local"
  DB_PORT: "5432"
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: web-app-uploads
  namespace: production
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: production
  annotations:
    deployment.kubernetes.io/revision: "1"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
      tier: frontend
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    metadata:
      labels:
        app: web-app
        tier: frontend
        version: "2.1.0"
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: web
        image: myregistry/web-app:2.1.0
        ports:
        - containerPort: 8080
          protocol: TCP
        envFrom:
        - configMapRef:
            name: web-app-config
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        resources:
          requests:
            cpu: "250m"
            memory: "256Mi"
          limits:
            cpu: "1000m"
            memory: "512Mi"
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          failureThreshold: 30
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        volumeMounts:
        - name: uploads
          mountPath: /app/uploads
      volumes:
      - name: uploads
        persistentVolumeClaim:
          claimName: web-app-uploads
apiVersion: v1
kind: Service
metadata:
  name: web-app-svc
  namespace: production
spec:
  selector:
    app: web-app
    tier: frontend
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: ClusterIP

This example has:

  • A ConfigMap injected as environment variables
  • A PVC for persistent file storage
  • A Deployment with a rolling update strategy
  • CPU and memory requests and limits (Burstable QoS)
  • A startup probe for initial boot (gives 150 seconds: 30 failures * 5s period)
  • A liveness probe that restarts stuck containers
  • A readiness probe that gates traffic
  • A ClusterIP Service tied to the Deployment via labels

This is the shape of most production workloads. The complexity comes from tuning the probe timings and resource values for your specific application’s behavior.

Final Thoughts

Helm: The Package Manager for Kubernetes

If kubectl lets you apply individual YAML files, Helm lets you install entire applications with a single command. Think of it as apt or brew, but for Kubernetes.

Why Helm Exists

A real application on Kubernetes is not one YAML file. It is a Deployment, a Service, a ConfigMap, a Secret, an Ingress, maybe a PersistentVolumeClaim, maybe a ServiceAccount and RBAC rules. That is 6-8 YAML files, all of which need to be versioned, templated, and deployed together.

Helm bundles all of this into a chart. A chart is a directory of templates + a values.yaml file that controls the configuration. You install a chart, Helm renders the templates with your values, and applies everything to the cluster atomically.

Installation

# macOS
brew install helm

# Linux
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# Verify
helm version

Core Concepts

  • Chart: a package of Kubernetes YAML templates
  • Release: an installed instance of a chart in your cluster
  • Repository: a collection of charts (like a package registry)
  • values.yaml: default configuration for a chart, overridable at install time

Essential Commands

# Add a chart repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add stable https://charts.helm.sh/stable

# Update repo index
helm repo update

# Search for a chart
helm search repo bitnami/kafka
helm search hub nginx              # search Artifact Hub (public registry)

# Install a chart (creates a release named "my-kafka")
helm install my-kafka bitnami/kafka

# Install with custom values
helm install my-kafka bitnami/kafka 
  --set replicaCount=3 
  --set persistence.size=20Gi

# Install with a values file
helm install my-kafka bitnami/kafka -f my-values.yaml

# Install in a specific namespace (creates namespace if it doesn't exist)
helm install my-kafka bitnami/kafka 
  --namespace kafka 
  --create-namespace

# List all releases
helm list
helm list --all-namespaces

# Show the status of a release
helm status my-kafka

# Show the values a release was installed with
helm get values my-kafka

# Show all rendered YAML for a release (without installing)
helm get manifest my-kafka

# Upgrade a release (apply new values or new chart version)
helm upgrade my-kafka bitnami/kafka -f my-values.yaml

# Install or upgrade in one command
helm upgrade --install my-kafka bitnami/kafka -f my-values.yaml

# Rollback a release to the previous version
helm rollback my-kafka

# Rollback to a specific revision
helm rollback my-kafka 2

# Show release history
helm history my-kafka

# Uninstall a release (deletes all k8s resources it created)
helm uninstall my-kafka

Dry Run and Debugging

# Render templates without installing (validate YAML before applying)
helm install my-app ./mychart --dry-run

# Print rendered templates to stdout
helm template my-app ./mychart

# Lint a chart for errors
helm lint ./mychart

# Debug template rendering
helm template my-app ./mychart --debug

Creating Your Own Chart

# Scaffold a new chart
helm create mychart

This creates:

mychart/
  Chart.yaml          # chart metadata (name, version, description)
  values.yaml         # default values
  templates/          # YAML templates
    deployment.yaml
    service.yaml
    ingress.yaml
    _helpers.tpl      # reusable template fragments
    NOTES.txt         # printed after install
  charts/             # chart dependencies

values.yaml and Templates

# values.yaml
replicaCount: 2
image:
  repository: nginx
  tag: "1.25"
  pullPolicy: IfNotPresent
service:
  type: ClusterIP
  port: 80
resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 256Mi
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "mychart.fullname" . }}
spec:
  replicas: {{ .Values.replicaCount }}
  template:
    spec:
      containers:
        - name: {{ .Chart.Name }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - containerPort: {{ .Values.service.port }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}

The {{ .Values.X }} syntax is Go templating. Helm renders the template by substituting values at install time.

Overriding Values

# Override at install time
helm install myapp ./mychart --set replicaCount=5

# Use an environment-specific values file
helm upgrade --install myapp ./mychart 
  -f values.yaml 
  -f values.production.yaml    # production overrides (applied last, wins)

Chart Dependencies

# Chart.yaml
dependencies:
  - name: postgresql
    version: "15.x.x"
    repository: https://charts.bitnami.com/bitnami
  - name: redis
    version: "19.x.x"
    repository: https://charts.bitnami.com/bitnami
# Download dependencies into charts/
helm dependency update

Helm vs kubectl apply

kubectl apply Helm
Unit Single YAML file Chart (many files)
Versioning Git history Chart versions + release history
Rollback Manual (re-apply old YAML) helm rollback
Templating None Go templates + values
Sharing Copy YAML files Push chart to registry
Dependencies Manual Chart.yaml dependencies

Use kubectl apply for simple one-off resources. Use Helm when you are deploying an application with multiple components, or when you need to parameterize config across environments (dev, staging, production).

Kubernetes is a lot to absorb at once. The thing that helped me most was understanding that every component in k8s is just implementing the reconciliation loop in some domain.

The scheduler reconciles “pods without nodes” to “pods with nodes.” The kubelet reconciles “pods assigned to my node” to “containers actually running.” The Deployment controller reconciles “desired template” to “actual ReplicaSets.” kube-proxy reconciles “Service and Endpoint objects” to “iptables rules.”

Once you have that mental model, the entire system starts to feel coherent. You’re not memorizing a bunch of unrelated components. You’re seeing one pattern applied across dozens of domains.

The rest is details. Good details, worth knowing, but details.