KubeCon India 2026 (Mumbai) — Day 1 Deep Dives

07 · Root Without Risk — Kubernetes User Namespaces (KEP-127)

Deep dive 7 of 17 · Security, policy & identity

Jun 18, 2026 · conferences · 23 min read · 5100 words advanced

Root without risk — how user namespaces finally made container root safe.

conferences kubecon security user-namespaces container-isolation

Deep dive 7 of the KubeCon Mumbai 2026 series. Sumir Broota (Tech Architect, Kubestronaut) told the nine-year story of KEP-127 — Kubernetes user namespaces — which reached GA in v1.36. The whole feature reduces to one pod field: hostUsers: false. With it, root inside the container is no longer root on the host — UID 0 in the pod maps to an unprivileged high UID outside it. This post unpacks why that single flag took nine years and eleven releases to ship, the idmap-mount breakthrough that made it practical, and the full path from PodSpec down to the mount_setattr(MOUNT_ATTR_IDMAP) syscall.

This is the most kernel-level talk in the security cluster, and a perfect counterpart to the Kyverno deep dive: Kyverno governs what is allowed to run; user namespaces change what damage it can do if it escapes. It's defense in depth at the deepest layer.

The problem — root in the container is root on the host

By default, Linux containers share the host's user namespace. So a process running as uid=0(root) inside the container is the same uid=0(root) on the host. Add a powerful capability like CAP_SYS_ADMIN, and a container escape isn't a containment failure — it's full root on the host.

Default (shared user namespace): same root on both sides INSIDE CONTAINER uid=0(root) CAP_SYS_ADMIN, CAP_NET_ADMIN… ON THE HOST uid=0(root) escape ⇒ full root on host shared

Fig 1 — without user namespaces, container UID 0 and host UID 0 are the same identity.

One kernel bug = full host compromise. The talk listed the receipts: CVE-2019-5736 (runc host-binary overwrite), CVE-2024-21626 (runc working-directory leak), CVE-2021-25741 (symlink path traversal), and Azurescape. In every one, the damage was catastrophic because container root equalled host root. User namespaces don't fix the bugs — they remove the catastrophic blast radius when one is exploited.

CAP_SYS_ADMIN — a loaded gun, defused

The clearest demonstration was a side-by-side. The same pod, granted CAP_SYS_ADMIN, behaves completely differently depending on one field:

hostUsers: true (default pre-1.30)hostUsers: false (KEP-127)
mount -o bind /host /mntsucceedsmount -o bind /host /mntpermission denied
cat /mnt/etc/shadow → host hashes leakiduid=0 only inside the namespace
✗ host filesystem compromised✓ capability scoped to the pod's userns only

The capability is still present — but it's now scoped to the pod's own user namespace, where it can do no damage to the host. That's the essence of "root without risk": you keep the in-container privileges workloads sometimes need, while neutering their reach.

Nine years from idea to GA

The headline number: 9 years, 11 Kubernetes releases, 3 container runtimes modified, 7+ CVEs mitigated. The timeline of KEP-127:

WhenMilestone
2016Idea forms — multi-tenancy & breakout mitigation discussed.
2018Initial node-level implementation tested (PR closed in 2019).
2021idmap support lands in ext4/xfs/fat (Linux kernel 5.12).
v1.25 (2022)Alpha — stateless pods (PR #111090, by @rata).
v1.27 (2023)Volume-mount redesign — idmap replaces chown.
v1.28 (2023)Stateful-pod support; UserNamespacesSupport feature gate.
v1.30 (2024)Beta (off by default); configurable UID ranges (PR #123593).
v1.33 (2025)Beta on by default; Pod Security Standards integration.
v1.36 (2026)GA — generally available, with Prometheus metrics.

Why nine years? Four hard problems

The talk's most honest section explained that the delay wasn't bureaucracy — it was four genuinely hard problems, solved one at a time:

  1. Kernel support. idmap mounts need Linux 5.12+; tmpfs idmap support only arrived in 6.3+. Many distros and filesystems took years to catch up.
  2. Alpha rewrites. The v1.25 approach chown'd volume mounts — far too slow. v1.27 redesigned it around idmap mounts; v1.28 added stateful-pod support.
  3. CRI protocol overhaul. A new UserNamespace proto message, IDMapping added to Mount — and every runtime (containerd, CRI-O, runc/crun) had to follow.
  4. Storage & UX. The naive approach duplicated and chowned images per pod; volume compatibility (NFS, devices) and Pod Security Standards integration all needed solving.

The two changes that flipped the curve

idmap mounts — −97% overhead (v1.27)

The original method recursively chown'd the entire root filesystem of every pod so the files matched the pod's UID range — duplicating image storage per pod and adding tens of seconds to startup. idmap mounts replaced that with a virtual remapping: startup time dropped from roughly 30s (chown, v1.25) to ~1s (idmap, v1.27), saving −97% of image storage overhead. It needs Linux 5.12+ for files and 6.3+ for overlayfs (single idmap per overlayfs, PR #12092).

Why idmap is the unlock. Instead of physically rewriting ownership on disk for every pod, idmap tells the kernel to translate ownership on the fly during filesystem lookups. One shared image, many pods, each seeing the files as owned by its own root — with zero duplication. Without this, per-pod user namespaces would have been too slow and too storage-hungry to use in production. Performance, not security, was the long pole.

hostUsers: false — a one-line pod field (v1.30 → v1.36)

The other change was making it usable. By Beta the entire feature collapsed to a single pod-spec field: hostUsers: false. v1.30 shipped it Beta-off, v1.33 turned it on by default, v1.36 made it GA, and v1.34 added Prometheus metrics. Configurable UID/GID ranges arrived in PR #123593.

How it works — from PodSpec to UID 65536

The end-to-end path the talk traced, layer by layer:

PodSpec → kernel syscall kubectlhostUsers:false apiservervalidate field kubeletUsernsManager CRIUserNamespace containerdidmap snapshot runc + kernelmount_setattr

Fig 2 — the full chain: a pod field becomes a kernel idmap via the CRI and the runtime.

  • kubelet — UsernsManager. A per-node allocator (pkg/kubelet/userns/userns_manager.go). When a pod opts in (hostUsers: false; nil/true stays in shared NODE mode), the manager allocates a 65536-ID range by picking the next free slot in an allocation bitmap, persists it under /var/lib/kubelet/pods/$UID/userns, and returns a CRI UserNamespace{POD, uids, gids}.
  • CRI protocol. NamespaceOption gained a UserNamespace userns_options field; UserNamespace carries a mode (NODE|POD) and repeated IDMapping{host_id, container_id, length}. Mount gained uid_mappings/gid_mappings for idmap volumes.
  • containerd. From chown to idmap across PRs #7679 (initial userns CRI), #10387 (multi-entry mappings), #12092 (single idmap per overlayfs). Falls back to slow_chown if the kernel is < 6.3. Minimum: containerd ≥ 2.0, runc ≥ 1.2 / crun ≥ 1.9, kernel ≥ 6.3.
  • The syscall. mount_setattr(MOUNT_ATTR_IDMAP): open a userns fd with the desired UID/GID map, call mount_setattr referencing it, and the kernel rewrites ownership on-the-fly during VFS lookups. "No data is rewritten on disk — the mapping is virtual."

The mapping that makes root harmless

# UID mapping (length = 65536)
container UID 0   (root inside the pod)   →   host UID 65536  (unprivileged)
container UID 65535                        →   host UID 131071
# host UIDs 0–65535 (real system) are unreachable from the pod
This is the whole trick in three lines. Inside the pod, processes really are UID 0 with their capabilities — applications that demand root still work. But that UID 0 is mapped to host UID 65536, an account with no privileges on the host. Escape the container and you're not root on the box; you're an unprivileged user who can't read /etc/shadow, bind-mount the host, or ptrace host processes.

What it takes to adopt

For practitioners, the actionable checklist from the talk:

  • Kernel ≥ 6.3 for overlayfs idmap (5.12 minimum for plain files); containerd ≥ 2.0; runc ≥ 1.2 or crun ≥ 1.9.
  • Set hostUsers: false in the pod spec. On v1.33+ the feature is on by default.
  • Verify idmap is actually active: mount | grep overlay should show an ovl-idmapped… mount, not a recursive chown fallback.
  • Watch the v1.34+ Prometheus metrics to confirm pods are getting per-pod user namespaces rather than silently falling back.

FAQ

Does hostUsers: false break apps that need to run as root?

No — that's the point. Inside the pod, processes are still UID 0 with capabilities, so root-requiring apps work. The UID is just mapped to an unprivileged host UID, so the privilege is real inside the namespace and harmless outside it.

Why did idmap mounts matter so much?

The original chown approach duplicated and re-owned image files per pod — ~30s startup and big storage overhead. idmap remaps ownership virtually in the kernel (no on-disk rewrite), cutting startup to ~1s and storage overhead by ~97%. It turned user namespaces from "too expensive" into "default-on."

Is this the same as Pod Security Standards / running as non-root?

Complementary. runAsNonRoot avoids UID 0 in the container; user namespaces let you safely keep UID 0 by remapping it. PSS integration (v1.33) means the two work together — defense in depth, not either/or.

What if my kernel is too old?

Runtimes fall back to slow_chown (recursive chown) — functional but slow and storage-heavy. Verify with mount | grep overlay; if you don't see ovl-idmapped, you're on the fallback path and should upgrade the kernel to 6.3+.

Takeaways

  • Container root = host root by default — that shared user namespace is why escapes are catastrophic.
  • hostUsers: false remaps UID 0 to an unprivileged host UID, scoping capabilities to the pod's namespace.
  • idmap mounts were the unlock — virtual ownership remapping cut startup ~97% and made the feature practical.
  • Nine years, eleven releases — the hard part was kernel support, CRI changes across three runtimes, and storage UX, not the idea.
  • GA in v1.36, on by default since v1.33 — adopt it: kernel 6.3+, containerd 2.0+, verify with mount | grep ovl-idmapped.

Next in the series — Deep dive 08: Commit-Then-Disclose, which moves from preventing damage to responsibly handling the vulnerabilities that slip through.

References

← prev: the kyverno five next: commit-then-disclose →
© cvam — written in plaintext, served warm