// AI NATIVE STACK

AI Native › Data › Data Architecture › lakeFS

CRASH COURSE · AI-NATIVE · intermediate · 10 min read · v1.x

lakeFS — git for your data lake.

data-architecture ai-native lakefs versioning data-ops

TL;DR — lakeFS is an open-source platform that gives your data lake Git-like operations — branch, commit, merge, revert, diff — on top of S3, GCS, or Azure Blob. It's not a table format like Iceberg or Delta; it's a versioning and CI/CD layer that works with any format, any engine, and any file type. In the AI Native landscape it lives in Data › Data Architecture.

What it is

lakeFS is a Git-like version control system for data stored on object storage. It exposes an S3-compatible API, so every tool that speaks S3 (Spark, Trino, DuckDB, Airflow, pandas) works unchanged — you just point at a lakeFS endpoint instead of raw S3 and prefix paths with a branch name.

Under the hood, lakeFS stores metadata (which objects belong to which branch/commit) in a fast KV store (PostgreSQL or DynamoDB), while the actual data stays on your object storage. Branching is zero-copy — creating a branch is a metadata pointer, not a data duplication.

Why it exists

Data pipelines are code — but the data they produce has none of the safety nets code has. There's no "undo" for a bad ETL write, no isolated staging environment for testing a pipeline change, and no way to atomically promote a validated dataset to production. lakeFS exists to bring the DevOps workflow to data:

  • Branch to isolate experimental or in-progress data.
  • Commit to create immutable, reproducible snapshots.
  • Merge to promote validated data to the main branch atomically.
  • Revert to undo a bad write instantly.

How it works

Spark / Trino any S3 client lakeFS S3-compatible API branch · commit · merge metadata in Postgres S3 / GCS / Azure actual data files

Fig 1 — Apps talk to lakeFS via S3 API; lakeFS manages branches/commits in metadata, data stays on object storage.

The workflow mirrors Git:

# create a branch for your ETL experiment
lakectl branch create lakefs://repo/experiment --source lakefs://repo/main

# run your pipeline pointing at the branch
spark.read.parquet("s3a://repo/experiment/features/")

# validate
lakectl diff lakefs://repo/experiment lakefs://repo/main

# promote to production
lakectl merge lakefs://repo/experiment lakefs://repo/main

# oops, bad merge? revert
lakectl branch revert lakefs://repo/main --commit <commit-id>

Key capabilities

  • Zero-copy branching — branches are metadata pointers; petabyte tables branch instantly.
  • Atomic commits and merges — no partial states visible to readers.
  • Format-agnostic — works with Parquet, Iceberg, Delta, CSV, images, model weights — any object.
  • Pre-commit hooks — run validation (schema checks, data quality tests) before data lands on main.
  • Garbage collection — safely clean up unreferenced objects from deleted branches.
  • CI/CD for data — integrate with Airflow, GitHub Actions, or any orchestrator to test-then-promote data changes.
  • S3/GCS/Azure gateway — no client-side changes; works with any tool that reads object storage.

Why it matters for AI

ML reproducibility requires pinning both code and data to a specific version. lakeFS commits give you immutable data snapshots that you can tag with an experiment ID. You can branch the training dataset, add augmented samples, validate, and merge — all without risking the production dataset. Model training becomes reproducible, auditable, and reversible.

When to use, when to skip

Use it when you need Git-like workflows over data — CI/CD for pipelines, isolated development environments, reproducible ML datasets, or safe rollbacks. It complements Iceberg/Delta (which handle table-level transactions) by adding repository-level branching.

Skip it if time-travel in your table format (Iceberg snapshots, Delta versions) gives you enough versioning, or if your data is small enough that copying directories is practical.

heads up lakeFS is a versioning layer, not a table format. It doesn't give you schema evolution, partition pruning, or column stats — that's still your table format's job. Think of it as "Git for the whole lake" and Iceberg/Delta as "ACID for individual tables."

vs the alternatives

ToolBest forTrade-off
lakeFSGit workflows over any dataAnother service to run; complements, not replaces, table formats
Apache IcebergTable-level time travel, multi-engineTable-scoped, not repo-scoped versioning
Delta LakeTable-level ACID, Spark ecosystemTable-scoped versioning
NessieGit-like catalog for Iceberg tablesIceberg-only; not format-agnostic

References

Extra reads

Verified against the lakeFS docs (docs.lakefs.io), May 2026. Targets v1.x.

← AI Native Stack
© cvam — written in plaintext, served warm