// AI NATIVE STACK

AI Native › Data › Data Architecture › Delta Lake

CRASH COURSE · AI-NATIVE · intermediate · 11 min read · v3.2

Delta Lake — ACID transactions on your data lake, no excuses.

data-architecture ai-native delta-lake lakehouse spark

TL;DR — Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata, time travel, and schema enforcement to data lakes. It stores data as Parquet files plus a JSON transaction log (_delta_log/). Born at Databricks, now a Linux Foundation project, it's the default table format in the Spark/Databricks ecosystem. In the AI Native landscape it lives in Data › Data Architecture.

What it is

Delta Lake is an open table format that layers reliability on top of cloud object storage. Every table is a directory of Parquet files plus a _delta_log/ folder containing an ordered sequence of JSON commit files. Each commit records which files were added, removed, or modified — giving you a full audit trail and the ability to travel back in time to any version.

It was created by Databricks and open-sourced in 2019. Delta Lake 3.x is engine-agnostic via the delta-kernel libraries (Java, Rust), and UniForm can auto-generate Iceberg and Hudi metadata so other engines read Delta tables natively.

Why it exists

Data lakes promised cheap storage and schema-on-read flexibility but delivered unreliable pipelines. Failed jobs leave partial writes, concurrent writers corrupt tables, and there's no way to undo a bad ETL run. Delta Lake exists to close the gap between a raw data lake and a data warehouse — the "lakehouse" pattern — without giving up the openness of Parquet on object storage.

How it works

The transaction log is the single source of truth. Every write operation appends a new JSON file to _delta_log/:

_delta_log/ 000.json → 001.json → 002.json → ... → checkpoint.parquet Parquet files data-0001.parquet ... Readers replay log → file list

Fig 1 — Delta's transaction log tracks every file add/remove; readers replay the log to build the current table state.

Every 10 commits, a checkpoint (Parquet) is written that compacts the log so readers don't replay from version 0. Concurrent writers use optimistic concurrency control — if two writers commit at the same time, the second one retries against the first's committed state.

Key capabilities

  • ACID transactions — serializable writes via the ordered log; no partial reads or corrupt tables.
  • Time travel — read any historical version: SELECT * FROM t VERSION AS OF 42 or TIMESTAMP AS OF '2026-05-01'.
  • Schema enforcement & evolution — reject writes that don't match the schema, or safely merge new columns.
  • MERGE / UPSERT / DELETE — row-level operations on Parquet-backed tables, including CDC with Change Data Feed.
  • Z-ordering & liquid clustering — colocate related data for faster queries; liquid clustering auto-manages without manual OPTIMIZE.
  • UniForm — auto-generates Iceberg/Hudi metadata so non-Spark engines read Delta tables without conversion.
  • Deletion vectors — mark rows deleted without rewriting files; dramatically faster UPDATE/DELETE/MERGE.

Quick start

from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession

builder = SparkSession.builder \
    .appName("delta-demo") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()

# write
df = spark.range(100).toDF("id")
df.write.format("delta").save("/tmp/delta-table")

# time travel
spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table").show()

# merge
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/tmp/delta-table")
dt.alias("t").merge(updates.alias("u"), "t.id = u.id") \
  .whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Why it matters for AI

Feature stores, training datasets, and evaluation sets all benefit from versioning and ACID semantics. Delta's Change Data Feed lets ML pipelines detect exactly which rows changed since the last training run — enabling incremental retraining. Time travel pins reproducible datasets to experiment runs, and schema enforcement catches upstream data drift before it corrupts model inputs.

When to use, when to skip

Use it when your stack is Spark/Databricks-centric, you need MERGE/UPSERT workloads, or you want the richest Spark integration (Unity Catalog, liquid clustering, deletion vectors). UniForm makes it friendlier to multi-engine setups than before.

Skip it if your primary engines are Trino, Flink, or DuckDB and you don't use Spark — Iceberg has broader native engine support. Also skip for tiny datasets that don't need transactional guarantees.

heads up Delta Lake OSS and Databricks Delta have feature gaps — liquid clustering, UniForm, and some optimizations require Databricks runtime or Delta 3.x+. Check which features are open-source vs. proprietary.

vs the alternatives

FormatBest forTrade-off
Delta LakeSpark/Databricks shops, MERGE-heavyHistorically Spark-coupled; catching up on openness
Apache IcebergMulti-engine lakehouseNo built-in MERGE until engine implements it
Apache HudiUpsert/CDC ingestionHigher complexity
lakeFSGit-like data branchingVersioning layer, not a table format

References

Extra reads

Verified against the Delta Lake docs (delta.io), May 2026. Targets v3.2+.

← AI Native Stack
© cvam — written in plaintext, served warm