Deep dive 7 of the KubeCon Mumbai 2026 series. Sumir Broota (Tech Architect, Kubestronaut) told the nine-year story of KEP-127 — Kubernetes user namespaces — which reached GA in v1.36. The whole feature reduces to one pod field: hostUsers: false. With it, root inside the container is no longer root on the host — UID 0 in the pod maps to an unprivileged high UID outside it. This post unpacks why that single flag took nine years and eleven releases to ship, the idmap-mount breakthrough that made it practical, and the full path from PodSpec down to the mount_setattr(MOUNT_ATTR_IDMAP) syscall.
This is the most kernel-level talk in the security cluster, and a perfect counterpart to the Kyverno deep dive: Kyverno governs what is allowed to run; user namespaces change what damage it can do if it escapes. It's defense in depth at the deepest layer.
The problem — root in the container is root on the host
By default, Linux containers share the host's user namespace. So a process running as uid=0(root) inside the container is the same uid=0(root) on the host. Add a powerful capability like CAP_SYS_ADMIN, and a container escape isn't a containment failure — it's full root on the host.
Fig 1 — without user namespaces, container UID 0 and host UID 0 are the same identity.
CAP_SYS_ADMIN — a loaded gun, defused
The clearest demonstration was a side-by-side. The same pod, granted CAP_SYS_ADMIN, behaves completely differently depending on one field:
hostUsers: true (default pre-1.30) | hostUsers: false (KEP-127) |
|---|---|
mount -o bind /host /mnt → succeeds | mount -o bind /host /mnt → permission denied |
cat /mnt/etc/shadow → host hashes leak | id → uid=0 only inside the namespace |
| ✗ host filesystem compromised | ✓ capability scoped to the pod's userns only |
The capability is still present — but it's now scoped to the pod's own user namespace, where it can do no damage to the host. That's the essence of "root without risk": you keep the in-container privileges workloads sometimes need, while neutering their reach.
Nine years from idea to GA
The headline number: 9 years, 11 Kubernetes releases, 3 container runtimes modified, 7+ CVEs mitigated. The timeline of KEP-127:
| When | Milestone |
|---|---|
| 2016 | Idea forms — multi-tenancy & breakout mitigation discussed. |
| 2018 | Initial node-level implementation tested (PR closed in 2019). |
| 2021 | idmap support lands in ext4/xfs/fat (Linux kernel 5.12). |
| v1.25 (2022) | Alpha — stateless pods (PR #111090, by @rata). |
| v1.27 (2023) | Volume-mount redesign — idmap replaces chown. |
| v1.28 (2023) | Stateful-pod support; UserNamespacesSupport feature gate. |
| v1.30 (2024) | Beta (off by default); configurable UID ranges (PR #123593). |
| v1.33 (2025) | Beta on by default; Pod Security Standards integration. |
| v1.36 (2026) | GA — generally available, with Prometheus metrics. |
Why nine years? Four hard problems
The talk's most honest section explained that the delay wasn't bureaucracy — it was four genuinely hard problems, solved one at a time:
- Kernel support. idmap mounts need Linux 5.12+; tmpfs idmap support only arrived in 6.3+. Many distros and filesystems took years to catch up.
- Alpha rewrites. The v1.25 approach
chown'd volume mounts — far too slow. v1.27 redesigned it around idmap mounts; v1.28 added stateful-pod support. - CRI protocol overhaul. A new
UserNamespaceproto message,IDMappingadded toMount— and every runtime (containerd, CRI-O, runc/crun) had to follow. - Storage & UX. The naive approach duplicated and chowned images per pod; volume compatibility (NFS, devices) and Pod Security Standards integration all needed solving.
The two changes that flipped the curve
idmap mounts — −97% overhead (v1.27)
The original method recursively chown'd the entire root filesystem of every pod so the files matched the pod's UID range — duplicating image storage per pod and adding tens of seconds to startup. idmap mounts replaced that with a virtual remapping: startup time dropped from roughly 30s (chown, v1.25) to ~1s (idmap, v1.27), saving −97% of image storage overhead. It needs Linux 5.12+ for files and 6.3+ for overlayfs (single idmap per overlayfs, PR #12092).
hostUsers: false — a one-line pod field (v1.30 → v1.36)
The other change was making it usable. By Beta the entire feature collapsed to a single pod-spec field: hostUsers: false. v1.30 shipped it Beta-off, v1.33 turned it on by default, v1.36 made it GA, and v1.34 added Prometheus metrics. Configurable UID/GID ranges arrived in PR #123593.
How it works — from PodSpec to UID 65536
The end-to-end path the talk traced, layer by layer:
Fig 2 — the full chain: a pod field becomes a kernel idmap via the CRI and the runtime.
- kubelet — UsernsManager. A per-node allocator (
pkg/kubelet/userns/userns_manager.go). When a pod opts in (hostUsers: false;nil/truestays in shared NODE mode), the manager allocates a 65536-ID range by picking the next free slot in an allocation bitmap, persists it under/var/lib/kubelet/pods/$UID/userns, and returns a CRIUserNamespace{POD, uids, gids}. - CRI protocol.
NamespaceOptiongained aUserNamespace userns_optionsfield;UserNamespacecarries a mode (NODE|POD) and repeatedIDMapping{host_id, container_id, length}.Mountgaineduid_mappings/gid_mappingsfor idmap volumes. - containerd. From chown to idmap across PRs #7679 (initial userns CRI), #10387 (multi-entry mappings), #12092 (single idmap per overlayfs). Falls back to
slow_chownif the kernel is < 6.3. Minimum: containerd ≥ 2.0, runc ≥ 1.2 / crun ≥ 1.9, kernel ≥ 6.3. - The syscall.
mount_setattr(MOUNT_ATTR_IDMAP): open a userns fd with the desired UID/GID map, callmount_setattrreferencing it, and the kernel rewrites ownership on-the-fly during VFS lookups. "No data is rewritten on disk — the mapping is virtual."
The mapping that makes root harmless
# UID mapping (length = 65536) container UID 0 (root inside the pod) → host UID 65536 (unprivileged) container UID 65535 → host UID 131071 # host UIDs 0–65535 (real system) are unreachable from the pod
/etc/shadow, bind-mount the host, or ptrace host processes.What it takes to adopt
For practitioners, the actionable checklist from the talk:
- Kernel ≥ 6.3 for overlayfs idmap (5.12 minimum for plain files); containerd ≥ 2.0; runc ≥ 1.2 or crun ≥ 1.9.
- Set
hostUsers: falsein the pod spec. On v1.33+ the feature is on by default. - Verify idmap is actually active:
mount | grep overlayshould show anovl-idmapped…mount, not a recursive chown fallback. - Watch the v1.34+ Prometheus metrics to confirm pods are getting per-pod user namespaces rather than silently falling back.
FAQ
Does hostUsers: false break apps that need to run as root?
No — that's the point. Inside the pod, processes are still UID 0 with capabilities, so root-requiring apps work. The UID is just mapped to an unprivileged host UID, so the privilege is real inside the namespace and harmless outside it.
Why did idmap mounts matter so much?
The original chown approach duplicated and re-owned image files per pod — ~30s startup and big storage overhead. idmap remaps ownership virtually in the kernel (no on-disk rewrite), cutting startup to ~1s and storage overhead by ~97%. It turned user namespaces from "too expensive" into "default-on."
Is this the same as Pod Security Standards / running as non-root?
Complementary. runAsNonRoot avoids UID 0 in the container; user namespaces let you safely keep UID 0 by remapping it. PSS integration (v1.33) means the two work together — defense in depth, not either/or.
What if my kernel is too old?
Runtimes fall back to slow_chown (recursive chown) — functional but slow and storage-heavy. Verify with mount | grep overlay; if you don't see ovl-idmapped, you're on the fallback path and should upgrade the kernel to 6.3+.
Takeaways
- Container root = host root by default — that shared user namespace is why escapes are catastrophic.
hostUsers: falseremaps UID 0 to an unprivileged host UID, scoping capabilities to the pod's namespace.- idmap mounts were the unlock — virtual ownership remapping cut startup ~97% and made the feature practical.
- Nine years, eleven releases — the hard part was kernel support, CRI changes across three runtimes, and storage UX, not the idea.
- GA in v1.36, on by default since v1.33 — adopt it: kernel 6.3+, containerd 2.0+, verify with
mount | grep ovl-idmapped.
Next in the series — Deep dive 08: Commit-Then-Disclose, which moves from preventing damage to responsibly handling the vulnerabilities that slip through.
References
- KubeCon Mumbai 2026 — Day 1 index · the rest of the series
- Kubernetes — User Namespaces · KEP-127 · the feature and its history
- mount_setattr(2) · the MOUNT_ATTR_IDMAP syscall
- Pod Security Standards · complementary hardening (PSS integration in v1.33)