// AI NATIVE STACK

AI Native › Data › Data Science › Ray Data

CRASH COURSE · AI-NATIVE · intermediate · 11 min read · v2.x

Ray Data — distributed datasets built for ML pipelines.

data-science ai-native ray distributed ml-pipeline

TL;DR — Ray Data is the data loading and preprocessing library in the Ray ecosystem. It provides a Dataset abstraction — a distributed collection of Arrow-backed blocks — designed to feed data into distributed training and batch inference. It handles the "last mile" between storage and GPU: reading, transforming, shuffling, and streaming data across a cluster. In the AI Native landscape it lives in Data › Data Science.

What it is

Ray Data (formerly Ray Datasets) is one of Ray's built-in libraries, alongside Ray Train, Ray Tune, and Ray Serve. It gives you a lazy, distributed dataset API that reads from Parquet, CSV, JSON, images, Delta Lake, Iceberg, or custom data sources — and applies map, filter, groupby, and repartition operations across a Ray cluster.

Unlike Spark DataFrames (which are optimized for SQL analytics), Ray Data is optimized for ML workloads: streaming batches to GPU trainers, last-mile preprocessing (tokenization, image augmentation), and connecting heterogeneous data to Ray Train.

Why it exists

ML training has a data feeding problem. Your training data sits in Parquet on S3, but your model runs on 8 GPUs that each need a continuous stream of preprocessed batches. The traditional approach is "preprocess everything, save to disk, then load" — which doubles storage and goes stale. Ray Data exists to:

  • Stream data — read from storage and preprocess on-the-fly, no intermediate materialization.
  • Scale preprocessing — distribute tokenization, augmentation, or feature computation across CPU workers.
  • Bridge to training — hand batches directly to Ray Train, PyTorch DataLoader, or any trainer without serialization overhead.
  • Unify the pipeline — one framework for ingest → preprocess → train → batch-predict instead of stitching Spark + pandas + PyTorch.

How it works

S3 / Parquet data source Ray Data read → map → shuffle streaming execution Arrow blocks on workers Ray Train GPU trainers Ray Serve batch inference

Fig 1 — Ray Data reads and preprocesses data on CPU workers, then streams batches to GPU trainers or batch inference.

A Ray Dataset is a collection of blocks (Arrow tables or numpy arrays) distributed across Ray workers. Operations are lazy — calling .map() or .filter() builds an execution plan. Data flows when you call .iter_batches(), .to_torch(), or .materialize(). The execution is streaming by default: blocks are processed and passed downstream without waiting for the entire dataset to be read.

Key capabilities

  • Streaming execution — process data in a pipeline without materializing the full dataset in memory.
  • Distributed map — apply UDFs (tokenize, augment, featurize) across all cluster CPUs in parallel.
  • GPU preprocessing — run map operations on GPU workers for things like image resize or CLIP embedding.
  • Zero-copy to trainers — Arrow blocks feed directly into PyTorch/TensorFlow DataLoaders.
  • Rich I/O — Parquet, CSV, JSON, images, binary files, Delta Lake, Iceberg, BigQuery, MongoDB.
  • Repartition & shuffle — random shuffle for training, repartition for balanced parallelism.
  • Batch inference — use .map_batches(model_fn, compute=ActorPoolStrategy(...)) to run inference at scale.

Quick start

import ray

# read
ds = ray.data.read_parquet("s3://bucket/training-data/")

# preprocess
ds = ds.map(lambda row: {"tokens": tokenize(row["text"]), "label": row["label"]})
ds = ds.filter(lambda row: len(row["tokens"]) > 10)
ds = ds.random_shuffle()

# feed to PyTorch trainer
for batch in ds.iter_torch_batches(batch_size=64):
    logits = model(batch["tokens"])
    loss = criterion(logits, batch["label"])

# or batch inference
ds = ds.map_batches(MyModel, compute=ray.data.ActorPoolStrategy(size=4), concurrency=4)
ds.write_parquet("s3://bucket/predictions/")

Why it matters for AI

Ray Data is purpose-built for ML data loading. It solves the "data feeding" bottleneck in distributed training — where GPU utilization drops because CPUs can't preprocess fast enough. By distributing preprocessing across the cluster and streaming to GPUs, it keeps expensive accelerators fed. It's the native data layer for Ray Train (distributed training) and Ray Serve (model serving with batch transforms).

When to use, when to skip

Use it when you're already on Ray (Ray Train, Ray Serve, KubeRay), need distributed data preprocessing for training, or want to run batch inference at scale. It's the glue between storage and GPU.

Skip it for pure analytics or EDA — use Polars or Spark. Also skip if your data fits on one machine and a PyTorch DataLoader handles your preprocessing fine. Ray Data shines when single-node data loading is the bottleneck.

heads up Ray Data requires a Ray cluster (even ray.init() for local mode). If you're not using Ray for training or serving, the operational overhead of adding Ray just for data loading probably isn't worth it.

vs the alternatives

ToolBest forTrade-off
Ray DataDistributed ML data loading, GPU feedingNeeds Ray cluster
PyTorch DataLoaderSingle-node training data loadingNo distributed preprocessing
PolarsFast single-machine analyticsSingle-node only
Spark DataFrameDistributed analytics at scaleJVM, not ML-optimized
Mosaic StreamingDatasetStreaming from cloud for trainingTraining-only, narrower scope

References

Extra reads

Verified against the Ray docs (docs.ray.io), May 2026. Covers Ray v2.x.

← AI Native Stack
© cvam — written in plaintext, served warm