TL;DR — Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open table format built for upsert-heavy and CDC workloads on data lakes. It manages data as Parquet/ORC files with a timeline-based metadata layer that enables record-level inserts, updates, and deletes — plus incremental query for efficient downstream consumption. In the AI Native landscape it lives in Data › Data Architecture.
What it is
Hudi started at Uber in 2016 to solve the "how do I upsert into a data lake" problem. When you have billions of ride records and need to update a rider's info retroactively, rewriting entire partitions is impractical. Hudi tracks records by key, knows which files contain which keys, and can merge updates into the right files efficiently.
It graduated as an Apache top-level project and supports Spark, Flink, Presto, Trino, Hive, and StarRocks as read/write engines.
Why it exists
Iceberg and Delta Lake solve the "reliable data lake" problem broadly. Hudi's differentiator is its record-level index and first-class support for:
- Upserts — update existing records by key without scanning the entire table.
- Incremental queries — ask "what changed since my last read?" and get only the delta, perfect for streaming ETL and ML feature pipelines.
- CDC ingestion — consume database change streams (Debezium, etc.) and apply them efficiently.
How it works
Hudi has two table types that trade off write and read performance:
Fig 1 — CoW rewrites files for fast reads; MoR logs deltas for fast writes. The timeline tracks all operations.
- Copy on Write (CoW) — every upsert rewrites the affected Parquet file in full. Reads are fast (just scan Parquet), writes are heavier.
- Merge on Read (MoR) — upserts go into Avro log files alongside the base Parquet. Reads merge base + logs at query time. Periodic compaction merges logs into new base files.
The timeline is Hudi's transaction log — an ordered sequence of instants (commits, compactions, cleans) stored in .hoodie/. It's what enables time travel, incremental queries, and rollbacks.
Key capabilities
- Record-level upsert/delete — primary-key-aware writes that touch only affected files.
- Incremental queries — read only what changed since a given commit instant — no full-table scan.
- Multi-modal indexing — Bloom filters, simple indexes, record-level indexes, and bucket indexes to locate records fast.
- ACID transactions — serializable via timeline; failed writes are rolled back automatically.
- Time travel — query any committed point on the timeline.
- Concurrency control — optimistic concurrency with lock providers (ZooKeeper, DynamoDB, Hive Metastore).
- Table services — built-in compaction, clustering (data layout optimization), cleaning (old file removal), and indexing — can run inline or async.
Quick start
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.getOrCreate()
# write (upsert)
df = spark.createDataFrame([(1, "alice", 100), (2, "bob", 200)], ["id", "name", "score"])
df.write.format("hudi") \
.option("hoodie.table.name", "scores") \
.option("hoodie.datasource.write.recordkey.field", "id") \
.option("hoodie.datasource.write.precombine.field", "score") \
.mode("overwrite").save("/tmp/hudi-scores")
# upsert — update bob's score
update = spark.createDataFrame([(2, "bob", 350)], ["id", "name", "score"])
update.write.format("hudi") \
.option("hoodie.table.name", "scores") \
.option("hoodie.datasource.write.recordkey.field", "id") \
.option("hoodie.datasource.write.precombine.field", "score") \
.option("hoodie.datasource.write.operation", "upsert") \
.mode("append").save("/tmp/hudi-scores")
# incremental query — what changed?
spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", "0") \
.load("/tmp/hudi-scores").show()
Why it matters for AI
ML feature pipelines are fundamentally incremental — you don't want to recompute every feature from scratch on each training run. Hudi's incremental query tells your pipeline "here are the 50K rows that changed since yesterday" so feature computation is proportional to change volume, not total data volume. CDC ingestion from production databases into Hudi tables is the standard pattern for keeping feature stores fresh.
When to use, when to skip
Use it when your workload is upsert/CDC-heavy — streaming database replicas into a lake, maintaining slowly changing dimensions, or building incremental ML pipelines. Hudi's record-level index makes it the fastest format for key-based lookups and updates.
Skip it for append-only analytics where you never update or delete records. Iceberg or Delta Lake are simpler for pure analytics. Hudi's operational complexity (compaction, indexing, table services) is overhead you don't need if updates are rare.
vs the alternatives
| Format | Best for | Trade-off |
|---|---|---|
| Apache Hudi | Upsert/CDC, incremental pipelines | Higher operational complexity |
| Apache Iceberg | Multi-engine analytics | No built-in record index |
| Delta Lake | Spark/Databricks ecosystem | Historically Spark-coupled |
| lakeFS | Git-like data branching | Versioning layer, not a table format |
References
- Official site — docs, community, releases.
- apache/hudi — source, Spark/Flink bundles.
- Documentation — concepts, table types, configuration.
- Quick start guide — first table in 5 minutes.
Extra reads
- Uber's Hudi story — why they built it.
- Record-level index deep dive — how Hudi finds records without scanning.
Verified against the Apache Hudi docs (hudi.apache.org), May 2026. Targets v1.0+.