TL;DR — JuiceFS is a POSIX-compatible distributed filesystem that stores data in any object store (S3, GCS, MinIO) and metadata in a transactional database (Redis, PostgreSQL, TiKV). It gives AI training jobs the random-read performance of local NVMe via aggressive multi-level caching, while keeping the economics and capacity of cloud object storage. CNCF Sandbox project with a Kubernetes CSI driver.
What it is
JuiceFS is an open-source, cloud-native distributed filesystem designed for large-scale data workloads. It separates metadata (stored in engines like Redis, PostgreSQL, or TiKV) from data (stored in object storage like S3 or MinIO). It's a CNCF Sandbox project. In the AI Native landscape it lives in AI Native Infra › Storage.
Why it exists
AI training reads billions of small files (images, tokens, shards) with random access patterns. Object storage is cheap but slow for this. Local NVMe is fast but not shared. JuiceFS bridges the gap: data lives in object storage (cheap, infinite), but a local cache on each node's SSD serves hot reads at NVMe speed — and the POSIX interface means frameworks like PyTorch read it like a local directory.
Fig 1 — Pods mount JuiceFS like local storage; the client caches hot data on SSD and stores cold data in object storage.
How it works
The JuiceFS client runs as a FUSE mount or CSI driver. On reads, it checks the local SSD cache first — cache hits bypass the network entirely. On cache miss, it fetches from object storage and populates the cache. Metadata operations (ls, stat, open) go to the metadata engine, which is a fast transactional store. Writes are buffered locally, then flushed to object storage asynchronously.
Key features
- POSIX compatible — works with PyTorch DataLoader, HuggingFace datasets, any tool expecting a filesystem.
- Multi-level caching — kernel page cache → local SSD → object storage. Hot data stays on NVMe.
- Elastic capacity — data lives in object storage, so capacity scales to petabytes without provisioning.
- Kubernetes CSI driver — mount JuiceFS as a PV in pods, with cache lifecycle managed per node.
- Multiple metadata engines — Redis (fastest), PostgreSQL, MySQL, TiKV (scalable), SQLite (single-node).
- Strong consistency — close-to-open consistency by default, immediate consistency optional.
Quick start
Format a filesystem, then mount it:
# format — metadata in Redis, data in S3
juicefs format \
--storage s3 \
--bucket https://my-bucket.s3.amazonaws.com \
redis://localhost:6379/1 \
mydata
# mount
juicefs mount redis://localhost:6379/1 /mnt/jfs --cache-dir /ssd/jfs-cache
On Kubernetes, install the CSI driver via Helm, then create a PV/PVC referencing the JuiceFS volume — pods mount it transparently.
When to use, when to skip
Use it when training jobs need fast, shared access to large datasets stored in object storage — especially multi-node distributed training where every GPU worker needs the same data. The caching layer eliminates the object-storage latency tax.
Skip it for pure streaming workloads (video, large sequential reads) where object storage throughput is already sufficient, or when your dataset fits on a single node's local disk. Also overkill for serving/inference where data access is minimal.
vs / alongside
| Tool | Approach | Note |
|---|---|---|
| JuiceFS | POSIX FS on object storage, SSD caching | Best for random-read AI workloads |
| Alluxio | Data orchestration / caching layer | Java-based, broader Hadoop ecosystem |
| MinIO | S3-compatible object storage | The data store, not a filesystem |
| CubeFS | Distributed FS with S3 and POSIX | CNCF, more ops overhead |
References
- JuiceFS documentation — concepts and guides.
- juicedata/juicefs — source (CNCF Sandbox).
- JuiceFS CSI Driver — Kubernetes integration.
Extra reads
- JuiceFS for AI training — best practices for ML workloads.
- JuiceFS blog — case studies and benchmarks.
Verified against JuiceFS docs (juicefs.com), May 2026.