// AI NATIVE STACK

AI Native › Data › Data Architecture › Apache Iceberg

CRASH COURSE · AI-NATIVE · intermediate · 12 min read · v2

Apache Iceberg — the table format that turned a data lake into a warehouse.

data-architecture ai-native iceberg lakehouse parquet

TL;DR — Apache Iceberg is an open table format for huge analytic datasets. It sits between your storage (S3, HDFS, GCS) and your compute engines (Spark, Trino, Flink, Dremio) and gives you ACID transactions, time travel, schema evolution, and partition evolution — things a pile of Parquet files never had. In the AI Native landscape it lives in Data › Data Architecture: the layer that makes your training and feature data reliable.

What it is

Iceberg is not a storage engine and not a query engine. It's a table format specification — a set of metadata files and rules that describe what data files belong to a table, how they're organized, and what the schema looks like. You keep writing Parquet (or ORC or Avro) files to object storage; Iceberg adds a metadata tree on top that gives you warehouse semantics.

Originally built at Netflix to fix the problems of Hive tables at petabyte scale, it graduated as a top-level Apache project and became the de facto open lakehouse format. It's engine-agnostic: Spark, Trino, Flink, StarRocks, DuckDB, Snowflake, and BigQuery all read Iceberg natively.

Why it exists

Traditional data lakes are just directories of files. That means:

  • No transactions — a writer crashes halfway and you get partial data.
  • No schema enforcement — a column rename breaks every downstream job.
  • Partition changes require full rewrites — want to change from daily to hourly partitioning? Rewrite the entire table.
  • No time travel — once you overwrite, the old data is gone.
  • File listing is O(n) — planning a query on a million-file table is slow because the engine has to list every directory.

Iceberg solves all of these with a metadata layer that tracks every file, every snapshot, and every schema version — without locking you into a single engine.

How it works

Iceberg tables have a three-level metadata tree:

metadata pointer metadata file schema + snapshots manifest list manifest list manifest manifest manifest manifest

Fig 1 — Iceberg's metadata tree: pointer → metadata file → manifest lists → manifests → data files.

  1. Metadata file — JSON/Avro file recording current schema, partition spec, snapshot history, and sort order.
  2. Manifest list — one per snapshot; lists which manifest files belong to that snapshot, with partition-level stats for quick pruning.
  3. Manifest file — lists individual data files (Parquet/ORC) with column-level min/max stats, null counts, and file sizes.

A commit is an atomic pointer swap: write new data files, write new manifests, write a new manifest list, update the metadata file, swap the pointer. Readers always see a consistent snapshot. Failed writes leave orphan files that are garbage-collected later.

Key capabilities

  • ACID transactions — serializable isolation via optimistic concurrency on the metadata pointer.
  • Time travel — every commit is a snapshot; query any historical version by snapshot ID or timestamp.
  • Schema evolution — add, drop, rename, or reorder columns without rewriting data; tracked by field IDs, not position.
  • Partition evolution — change partitioning strategy (daily → hourly, add a new partition field) without rewriting existing data.
  • Hidden partitioning — partition transforms (year, month, day, hour, bucket, truncate) are in metadata; users write SQL without knowing partition layout.
  • File-level stats — column min/max, null counts, and row counts in manifests enable aggressive scan pruning.
  • Engine-agnostic — one table, many readers and writers; no engine lock-in.

Quick start

Create an Iceberg table with PySpark and write to it:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "/tmp/iceberg-warehouse") \
    .getOrCreate()

spark.sql("CREATE TABLE local.db.events (id BIGINT, ts TIMESTAMP, payload STRING) USING iceberg")
spark.sql("INSERT INTO local.db.events VALUES (1, current_timestamp(), 'hello iceberg')")

# time travel
spark.sql("SELECT * FROM local.db.events VERSION AS OF 1").show()

For production, swap the Hadoop catalog for a REST catalog (Nessie, Polaris, Unity Catalog) backed by a real metastore and point the warehouse at S3/GCS.

Why it matters for AI

ML pipelines need reproducible datasets. Iceberg's time travel and snapshot isolation mean you can pin a training run to an exact dataset version, audit what changed between retrains, and roll back a bad feature-engineering commit without rewriting terabytes. Partition evolution also lets you restructure feature tables as requirements shift — no migration downtime.

When to use, when to skip

Use it when you have analytic or ML data on object storage and need transactions, time travel, schema evolution, or multi-engine access. It's the default choice for new lakehouse architectures.

Skip it for small datasets that fit in a single Parquet file, real-time OLTP workloads (use a database), or if you're already all-in on Delta Lake and don't need engine portability.

heads up Iceberg needs a catalog (REST, Hive, Glue, Nessie) to track table locations and handle concurrent commits. Without one, you're doing manual file management — which defeats the purpose.

vs the alternatives

FormatBest forTrade-off
Apache IcebergMulti-engine lakehouse, open ecosystemNeeds a catalog; no built-in compute
Delta LakeSpark-first shops, Databricks ecosystemHistorically Spark-coupled; UniForm bridges gap
Apache HudiUpsert-heavy / CDC ingestionMore complex; narrower engine support
lakeFSGit-like branching over any dataVersioning layer, not a table format

References

Extra reads

Verified against the Apache Iceberg docs and spec, May 2026. Targets v1.7+ (format v2).

← AI Native Stack
© cvam — written in plaintext, served warm