TL;DR — Apache Iceberg is an open table format for huge analytic datasets. It sits between your storage (S3, HDFS, GCS) and your compute engines (Spark, Trino, Flink, Dremio) and gives you ACID transactions, time travel, schema evolution, and partition evolution — things a pile of Parquet files never had. In the AI Native landscape it lives in Data › Data Architecture: the layer that makes your training and feature data reliable.
What it is
Iceberg is not a storage engine and not a query engine. It's a table format specification — a set of metadata files and rules that describe what data files belong to a table, how they're organized, and what the schema looks like. You keep writing Parquet (or ORC or Avro) files to object storage; Iceberg adds a metadata tree on top that gives you warehouse semantics.
Originally built at Netflix to fix the problems of Hive tables at petabyte scale, it graduated as a top-level Apache project and became the de facto open lakehouse format. It's engine-agnostic: Spark, Trino, Flink, StarRocks, DuckDB, Snowflake, and BigQuery all read Iceberg natively.
Why it exists
Traditional data lakes are just directories of files. That means:
- No transactions — a writer crashes halfway and you get partial data.
- No schema enforcement — a column rename breaks every downstream job.
- Partition changes require full rewrites — want to change from daily to hourly partitioning? Rewrite the entire table.
- No time travel — once you overwrite, the old data is gone.
- File listing is O(n) — planning a query on a million-file table is slow because the engine has to list every directory.
Iceberg solves all of these with a metadata layer that tracks every file, every snapshot, and every schema version — without locking you into a single engine.
How it works
Iceberg tables have a three-level metadata tree:
Fig 1 — Iceberg's metadata tree: pointer → metadata file → manifest lists → manifests → data files.
- Metadata file — JSON/Avro file recording current schema, partition spec, snapshot history, and sort order.
- Manifest list — one per snapshot; lists which manifest files belong to that snapshot, with partition-level stats for quick pruning.
- Manifest file — lists individual data files (Parquet/ORC) with column-level min/max stats, null counts, and file sizes.
A commit is an atomic pointer swap: write new data files, write new manifests, write a new manifest list, update the metadata file, swap the pointer. Readers always see a consistent snapshot. Failed writes leave orphan files that are garbage-collected later.
Key capabilities
- ACID transactions — serializable isolation via optimistic concurrency on the metadata pointer.
- Time travel — every commit is a snapshot; query any historical version by snapshot ID or timestamp.
- Schema evolution — add, drop, rename, or reorder columns without rewriting data; tracked by field IDs, not position.
- Partition evolution — change partitioning strategy (daily → hourly, add a new partition field) without rewriting existing data.
- Hidden partitioning — partition transforms (year, month, day, hour, bucket, truncate) are in metadata; users write SQL without knowing partition layout.
- File-level stats — column min/max, null counts, and row counts in manifests enable aggressive scan pruning.
- Engine-agnostic — one table, many readers and writers; no engine lock-in.
Quick start
Create an Iceberg table with PySpark and write to it:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1") \
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.local.type", "hadoop") \
.config("spark.sql.catalog.local.warehouse", "/tmp/iceberg-warehouse") \
.getOrCreate()
spark.sql("CREATE TABLE local.db.events (id BIGINT, ts TIMESTAMP, payload STRING) USING iceberg")
spark.sql("INSERT INTO local.db.events VALUES (1, current_timestamp(), 'hello iceberg')")
# time travel
spark.sql("SELECT * FROM local.db.events VERSION AS OF 1").show()
For production, swap the Hadoop catalog for a REST catalog (Nessie, Polaris, Unity Catalog) backed by a real metastore and point the warehouse at S3/GCS.
Why it matters for AI
ML pipelines need reproducible datasets. Iceberg's time travel and snapshot isolation mean you can pin a training run to an exact dataset version, audit what changed between retrains, and roll back a bad feature-engineering commit without rewriting terabytes. Partition evolution also lets you restructure feature tables as requirements shift — no migration downtime.
When to use, when to skip
Use it when you have analytic or ML data on object storage and need transactions, time travel, schema evolution, or multi-engine access. It's the default choice for new lakehouse architectures.
Skip it for small datasets that fit in a single Parquet file, real-time OLTP workloads (use a database), or if you're already all-in on Delta Lake and don't need engine portability.
vs the alternatives
| Format | Best for | Trade-off |
|---|---|---|
| Apache Iceberg | Multi-engine lakehouse, open ecosystem | Needs a catalog; no built-in compute |
| Delta Lake | Spark-first shops, Databricks ecosystem | Historically Spark-coupled; UniForm bridges gap |
| Apache Hudi | Upsert-heavy / CDC ingestion | More complex; narrower engine support |
| lakeFS | Git-like branching over any data | Versioning layer, not a table format |
References
- Official site — spec, docs, community.
- apache/iceberg — source, Java/Python/Go libraries.
- Documentation — concepts, configuration, Spark/Flink integration.
- Table format spec — the definitive reference for metadata layout.
Extra reads
- Apache Iceberg 101 — Dremio's intro with worked examples.
- Tabular blog — deep dives from the Iceberg creators.
- Netflix talk — Iceberg at scale — origin story and design decisions.
Verified against the Apache Iceberg docs and spec, May 2026. Targets v1.7+ (format v2).