Re-Architecting Monoliths into Kubernetes Microservices at Million-User Scale — KubeCon Mumbai 2026 Deep Dive 01

Deep dive 1 of the KubeCon Mumbai 2026 series. Aditya Sharma (Technical Architect, DevOps @ Lumenore) walked through a brutally honest 10+ year evolution of one platform — a BI & AI analytics product — from four bare-metal servers to a portable, multi-cloud, GitOps-driven Kubernetes platform serving millions of users across cloud and on-prem. The talk's gift isn't the destination; it's the order the lessons arrived in, and the outages that forced each step. This post expands every chapter of that journey into a migration playbook you can actually follow.

Most "monolith to microservices" talks are aspirational — a clean architecture diagram and a promise. This one was a war diary. Sharma's framing was a multi-year saga of "what broke, what hurt, and what we built so it never happened again." That structure is exactly why it's worth a deep dive: each architecture wasn't chosen on a whiteboard, it was extracted from a failure. Below, I walk the timeline year by year, explain the why behind every move, and pull out the reusable principle at each stage.

Who's talking and why it matters. Lumenore is a Business Intelligence & AI data-analytics platform — meaning it carries both a transactional workload (users, dashboards, metadata) and a heavy analytical workload (crunching large datasets). That dual nature shows up in every architecture below as the recurring split between an OLTP database (MySQL/Postgres) and an OLAP engine (Vertica, Spark). Keep that split in mind — it's the load-bearing constraint of the whole story.

Year 0–2 — the monolith era: one big app to rule them all

The starting point was deliberately humble: four servers total — two application servers and two database servers — running CentOS on bare metal. One app server ran Java, the other ran a Polymer JS front end. The data tier was already split by job: MySQL for transactional data and Vertica for analytical queries. Code and data were pulled onto the application servers with TortoiseSVN — yes, SVN, directly onto production boxes.

Fig 1 — the humble beginning: four boxes, SVN deploys, no orchestration.

The first pain points arrived right on cue as the platform crossed 1,000 users: HTTP 5xx errors under load, CPU throttling and memory spikes, and — the killer — no rolling updates. Every bug fix needed downtime. The team reached for the obvious lever, vertical scaling (bigger boxes), and learned the first hard lesson: it helps for a moment, then the cost-vs-ROI curve goes negative fast. Worse, both vertical and horizontal scaling on bare metal hit a hard ceiling gated by a slow, expensive hardware-procurement approval process. Hardware upgrades during maintenance windows meant downtime, which meant frustrated customers.

Principle that survives the whole talk: availability and scalability must be designed in — they cannot be bolted on later. Every subsequent chapter is the team paying interest on having learned this the hard way. If you're building today, this is the cheapest lesson to borrow.

Year 3–4 — the great unbundling: enter Docker and microservices

The turning point was a 6–8 month proof of concept that put Docker on all the servers. The immediate win was operational: replica-based scaling replaced running the application directly on the VM. Instead of one fat process per box, the team ran many container replicas, and costs dropped significantly because the same workload now fit on less hardware.

But the deeper realization was about the code, not the infrastructure. The data revealed two things: the monolithic codebase had become unmanageable — thousands of lines, impossible to debug — and the original Java + Polymer JS stack couldn't deliver the roadmap. So they made the call that names the talk: decompose the monolith into services with well-defined interfaces and clear ownership. The single codebase was branched into multiple languages and frameworks — Java, MongoDB, Spark, Node.js, Redis, Python — each service free to use the right tool.

This is also where observability was born (first generation): Zipkin for distributed tracing, Docker stats for CPU/memory/network, and logs shipped to NFS-mounted storage. Shared state across containers used GlusterFS, a scalable network filesystem. The toolchain matured too: TortoiseSVN → GitLab → Jenkins, the first real CI/CD pipeline.

The honest sequencing here is the lesson. They containerized first (Docker, replicas, cost win), and only then decomposed the monolith — because the data from running containers is what proved the monolith was the bottleneck. Containerization gave them the observability to make the microservices decision with evidence rather than fashion. That's the right order: get signal, then re-architect.

Year 5 — Kubernetes arrives: power, promise, and a painful outage

With many containers across many languages, manual Docker management stopped scaling, and Kubernetes entered the chat. The first cluster was modest: 3 servers — one master, two workers. The master ran kube-apiserver, scheduler, and controller-manager; the workers ran kubelet, kube-proxy, and Docker. DNS resolved straight to the master node's IP. The full application stack now ran as pods — Java, Python, Spark (big data), Node.js (UI), Redis (cache), MongoDB — while the heavy databases (MySQL for metadata, Vertica for analytics) stayed external to Kubernetes. Observability leveled up to v2: Prometheus (metrics), Grafana (dashboards), the ELK stack (logs), and Zipkin (traces).

Then the promise met reality. A single Kubernetes node failed — and the entire platform went down. The root cause was structural: a single control plane with no node redundancy designed for failure. To make it worse, monitoring alerts were being missed precisely when a pod or node went down — the observability had the same single points of failure as the workload.

The outage that defined the rest of the journey. This is the most important slide in the deck. A brand-new Kubernetes cluster, single master, gave them worse availability than they expected — because moving to Kubernetes doesn't grant high availability, it only makes HA possible. A one-master cluster is a single point of failure with extra steps. The response: stop, and design a genuinely highly-available architecture for zero-downtime operations.

Year 6 — "never again": a crash course in high availability

The HA rebuild is a textbook control-plane topology, and it's worth memorizing because it's the reference design most teams should copy:

Fig 2 — the HA reference design: VIP + dual HAProxy, 3-master etcd quorum, redundant zone-aware workers.

DNS resolves to an HAProxy + Keepalived virtual IP, with two HAProxy instances so the load balancer itself isn't a single point of failure.
Three master nodes with an etcd quorum — the control plane can lose a node and keep making decisions (quorum needs a majority, so three tolerates one failure).
Three worker nodes with zone-aware scheduling for application redundancy.
Observability v3 went HA too — Prometheus, Grafana, and Alertmanager in highly-available mode, deployed with the kube-prometheus-stack Helm chart.

The payoff was real: node failures became non-events, and Kubernetes version upgrades and OS maintenance became zero-downtime. But Sharma was candid about what this setup still couldn't do — and that honesty is what drives the next chapter:

Key limitations of the HA-but-regional setup: it was regional and stuck in one physical location (so a site-level disaster was still fatal); it was very costly to change CPU, RAM, storage type/speed, or vendor (bare metal locks you in); and it was hard to manage all that hardware and software as the number of environments, codebases, and users kept growing. High availability within one site is not business continuity.

Year 7–8 — run anywhere, break nowhere: a truly portable platform

The fix for "stuck in one location, locked to one vendor" was portability — and this is the chapter where the platform becomes genuinely cloud-native. The team moved to elastic managed services across Azure, Oracle, and AWS, with a clean DC-DR split: a primary data center (East US) replicating and syncing to a disaster-recovery region (West US).

The building blocks that made "run anywhere" true:

Layer	What they used
Packaging	Helm charts + Kustomize — one app definition, environment overlays.
Infrastructure as code	Terraform to provision and manage environments reproducibly.
Configuration management	Ansible for config and deployment automation.
Managed data	Fully managed MySQL (later moved to PostgreSQL) and Redis — automated backups, HA built in.
Secrets & storage	Cloud Storage account + Key Vault — encrypted data and credential storage with DR.
Pipeline & registry	Azure DevOps + Azure Repos for versioned code; images in Azure Container Registry (ACR).

The headline capability was Cilium Cluster Mesh: connecting multiple clusters across regions and clouds — AKS (Azure) and EKS (AWS), East and West — into one logical fabric for cost-effective resource and load distribution. With Helm + Kustomize + Terraform + Ansible, Lumenore could now be deployed into any cloud (Oracle, AWS, Azure, Google) or on-premises.

The on-prem story got a deliberate upgrade too, and it's a neat preview of several other Day-1 talks: containerd instead of Docker as the runtime, Cilium for networking, MetalLB for bare-metal load balancing, Ingress-Nginx (later migrated to the Gateway API), and Rook Ceph for storage. The platform runs on any OS where Kubernetes runs — RHEL, AlmaLinux, Ubuntu, Amazon Linux.

This is the modern cloud-native toolbelt in one slide. Notice how many pieces map to their own KubeCon sessions: Cilium (networking/observability), Rook Ceph (deep dive 16), the Gateway API migration. The lesson: portability isn't one tool, it's a discipline — IaC + declarative packaging + a CNI that spans clusters — that together make the underlying cloud a swappable detail.

Year 9–10 — security gets real: from assumed to actually engineered

Up to here, security was largely assumed. This chapter is about making it engineered, across four layers — code, container, cluster, and cloud. The deck organized it as a three-column remediation program, and it's one of the most complete defense-in-depth checklists you'll see on a conference slide:

Layer	Tools & controls
Code & container	SonarQube + OWASP Dependency-Check (SAST + vulnerable deps in CI), Trivy (CVE/secret/license scanning of images, repos, even VMs), Docker multi-stage builds (smaller images, less attack surface), Syft SBOM + Cosign image signing (provenance + integrity), KubeLinter & Hadolint (manifest/Dockerfile linting).
Cluster	Pod Security Admission (restricted profile across namespaces), Kubernetes Network Policies (deny-by-default, least-privilege pod-to-pod), Falco + Kyverno + Kubescape + Trivy Operator (runtime threat detection, policy enforcement, continuous scanning), Cilium Hubble UI (network-flow visibility, dropped-packet debugging).
Cloud	Network Security Groups (explicit allow-lists), IAM (least-privilege, MFA, RBAC, KMS, encryption in transit/at rest, secret rotation), encrypted off-site backups with RTO/RPO recovery plans, OS/node auto-patching with vulnerability SLAs, audit logging into Wazuh SIEM, and WAF + firewall + DDoS protection at the edge.

The unifying idea is the CI/CD security gate: Commit → Build → Scan → Quality Gate → Deploy. Security checks aren't a separate audit; they're a wall the pipeline can't get past. This connects directly to several other Day-1 talks — Kyverno gets its own session (deep dive 06), as does least-privilege access (deep dive 07) and coordinated disclosure (deep dive 08).

Year 11–13 — the dream stack: resilient, observable, automated

The final chapter is the "good enough → truly production-grade" jump, built on three pillars. Each one, again, was a fix for a specific recurring pain.

1. Argo CD — GitOps as the single source of truth

The problem: engineers made direct changes via the Kubernetes dashboard, skipping Git; Jenkins pipelines later overwrote those changes silently; there was no single source of truth and the live cluster state and the repository drifted constantly. The fix: move all manifests (Deployments, Services, ConfigMaps, NetworkPolicies, HPAs) into Git, and let Argo CD continuously reconcile the live cluster against the declared state. Manual drift is detected and auto-corrected, rollbacks become fast, traceable, low-risk Git reverts, and you get a full audit trail of who changed what, when, and why.

2. Istio — the service mesh

The problem: doing mTLS in application code meant changing every service (impractical); transient failures cascaded across dependencies; and there was no clean way to reroute traffic or do A/B deployments. What Istio gave them, transparently to app code: strict mTLS between all services, automatic retries, circuit breaking (isolate a struggling service, shed load gracefully), traffic management (canary, weighted splits, real-time rerouting), a multi-cluster mesh, and zero-trust networking — identity and authorization on every call. Kiali provided the real-time service graph (traffic flows, error rates, latency per service).

3. OpenTelemetry — unified observability

The problem: every service had its own logging and tracing setup — inconsistent and hard to correlate — with metrics, traces, and logs living in separate silos, and getting them meant modifying application code. The fix: OpenTelemetry auto-instrumentation across all services (zero code change per service), unifying all three signal types, vendor-neutral so the backend can change without touching app code. The OTel Collector fans out to Prometheus (metrics), Jaeger (traces), and Loki (logs), all visualized in Grafana — one place, better correlation, standardized across cloud and on-prem.

Fig 3 — OTel decouples instrumentation from backends: change Jaeger for Tempo and no app code moves.

The migration playbook, distilled

Strip away the specific tools and the journey is a repeatable sequence. If you're staring at your own monolith, this is the order that worked:

Stage	Move	The trigger that justified it
1	Containerize the monolith (Docker, replicas)	Vertical scaling went cost-negative; needed cheaper scaling + real metrics.
2	Decompose into services by ownership	Codebase unmanageable; one stack couldn't deliver the roadmap.
3	Orchestrate with Kubernetes	Manual container management stopped scaling across many services.
4	Make the control plane HA (3 masters, VIP)	A single node failure took the whole platform down.
5	Make it portable (IaC + Helm + cluster mesh)	Regional lock-in and vendor lock-in were existential risks.
6	Engineer security across all four layers	Security was assumed, not verified — unacceptable at scale.
7	GitOps + service mesh + OTel	Drift, cascading failures, and siloed telemetry capped reliability.

The meta-lesson. Notice that every stage was pulled forward by a concrete failure, not pushed by a trend. That's the healthiest way to adopt complexity: let the pain name the next tool. The companion microservices wisdom — start simple, evolve under pressure — is exactly what this 13-year arc demonstrates in production.

FAQ

Should I containerize before or after splitting the monolith?

Per this talk, containerize first. Running the monolith in containers is low-risk, cuts cost immediately via replica scaling, and — crucially — gives you the observability (metrics, traces) to decide where the real service boundaries are. Splitting blind is how you build a distributed monolith.

Why keep the databases (Vertica, MySQL) outside Kubernetes?

Stateful, performance-sensitive analytical and transactional databases were run as external managed/dedicated services rather than as pods, especially in the earlier years. It sidesteps the hardest parts of stateful Kubernetes while you mature — though by the portable-platform stage they'd moved to fully managed MySQL/PostgreSQL and Redis with automated backups and HA. (Stateful-on-K8s is exactly what the Rook deep dive tackles.)

Does moving to Kubernetes give me high availability?

No — and that's the talk's sharpest lesson. A single-master cluster is a single point of failure. Kubernetes makes HA possible (3-master etcd quorum, redundant zone-aware workers, an HA load balancer), but you have to design and pay for it. The Year-5 outage happened on Kubernetes.

What's the smallest version of this I should copy?

The Year-6 HA topology: a virtual IP fronted by two load balancers, three control-plane nodes with etcd quorum, redundant workers, and an HA monitoring stack via kube-prometheus-stack. That single design turns node failures into non-events and makes upgrades zero-downtime.

Takeaways

Availability and scalability are designed in, not bolted on. The whole 13-year arc is interest paid on learning this late.
Let failures choose your next tool. Each architecture was extracted from a specific outage or cost wall — not adopted because it was trendy.
Kubernetes ≠ HA. A single control plane is a SPOF; the 3-master + VIP topology is the design to copy.
Portability is a discipline, not a product. IaC + declarative packaging + a cluster-spanning CNI (Cilium mesh) make the cloud a swappable detail and unlock DC-DR.
Security and observability are cross-cutting, automated layers — a CI/CD quality gate and OTel auto-instrumentation, not per-service afterthoughts.
GitOps closes the loop. Argo CD reconciling Git → cluster kills drift and makes rollbacks trivial.

Next in the series — Deep dive 02: Shared-First Kubernetes Platforms, which picks up exactly where this leaves off: once you have a portable platform, how do you let many teams share it without the platform becoming a blast radius?

References

KubeCon Mumbai 2026 — Day 1 index · the rest of the series
Talk listing — KubeCon India 2026 · sched.com
Lumenore · the platform in the case study
Microservices · Fowler & Lewis — the canonical intro
Argo CD · Istio · OpenTelemetry · the dream-stack pillars

Re-architecting monoliths into Kubernetes microservices, at million-user scale.