// AI NATIVE STACK

AI Native › Data › Data Science › Pandas

CRASH COURSE · AI-NATIVE · beginner · 10 min read · v2.2

Pandas — the DataFrame library every ML engineer learns first.

data-science ai-native pandas dataframe python

TL;DR — Pandas is the standard Python library for tabular data manipulation. It gives you DataFrame and Series objects for loading, cleaning, transforming, aggregating, and exploring datasets — all in memory, on a single machine. It's the first tool in every data scientist's toolkit and the default for datasets that fit in RAM. In the AI Native landscape it lives in Data › Data Science.

What it is

Pandas provides two core data structures: Series (1D labeled array) and DataFrame (2D labeled table — think spreadsheet or SQL table in Python). It wraps NumPy arrays with labeled axes, rich indexing, and hundreds of built-in operations for data wrangling.

Created by Wes McKinney in 2008 at AQR Capital, it's now the most-used data library in the Python ecosystem — the gateway drug to ML for millions of practitioners.

Why it exists

Before pandas, Python data work meant nested loops over lists, manual CSV parsing, or jumping to R. NumPy handles numeric arrays well but has no concept of column names, mixed types, missing data handling, or group-by. Pandas fills the gap between raw NumPy and a full database, giving you SQL-like operations in a Python-native API.

Key capabilities

  • I/O — read/write CSV, Parquet, JSON, Excel, SQL, HDF5, Feather, and more with one function call.
  • Selection & filtering — label-based (.loc), position-based (.iloc), boolean masks, query strings.
  • Transformation — apply, map, vectorized string/datetime ops, pivot, melt, stack/unstack.
  • Aggregationgroupby, agg, transform, window functions, rolling/expanding.
  • Missing dataNaN/NA handling with fillna, dropna, interpolation.
  • Merging — SQL-style joins (merge), concatenation (concat), combine_first.
  • Time series — date ranges, resampling, shifting, rolling windows, timezone handling.
  • Copy-on-Write (2.x) — pandas 2.0+ defaults to CoW semantics, eliminating the SettingWithCopyWarning footgun.
  • Arrow backenddtype_backend="pyarrow" for faster I/O, lower memory, and nullable types.

Quick start

import pandas as pd

# load
df = pd.read_csv("training_data.csv")
df = pd.read_parquet("features.parquet")

# explore
df.head()
df.describe()
df.info()

# clean
df = df.dropna(subset=["label"])
df["feature"] = df["feature"].astype("float32")

# transform
df["log_price"] = df["price"].apply(lambda x: np.log1p(x))
grouped = df.groupby("category")["score"].mean()

# merge
merged = pd.merge(df, labels, on="id", how="left")

# save
merged.to_parquet("clean_features.parquet", index=False)

Why it matters for AI

Pandas is where feature engineering happens — loading raw data, cleaning nulls, encoding categoricals, computing aggregates, joining tables, and producing the matrix that goes into model.fit(). Every ML tutorial, every Kaggle kernel, every EDA notebook starts with import pandas as pd. scikit-learn, XGBoost, and PyTorch all accept pandas DataFrames as input.

When to use, when to skip

Use it for datasets that fit in memory (up to ~10 GB depending on your machine), interactive exploration, feature engineering, and anywhere the pandas API's breadth saves you time.

Skip it when data exceeds memory — switch to Polars for faster single-machine processing, or Ray Data / Spark for distributed workloads. Also skip for performance-critical production pipelines where pandas' single-threaded eager execution is the bottleneck.

heads up Pandas is single-threaded and eager by default — every operation materializes immediately. For large datasets, chain operations carefully and prefer vectorized ops over .apply() with Python lambdas. Consider the PyArrow backend (pd.options.mode.dtype_backend = "pyarrow") for 2–5× memory savings.

vs the alternatives

ToolBest forTrade-off
PandasUniversal, huge API, ecosystemSingle-threaded, memory-hungry
PolarsSpeed, larger-than-RAM, lazy evalSmaller ecosystem, different API
Ray DataDistributed datasets, ML pipelinesCluster overhead
Spark DataFramePetabyte-scale distributedJVM, cluster, latency

References

Verified against the pandas docs (pandas.pydata.org), May 2026. Covers v2.2+.

← AI Native Stack
© cvam — written in plaintext, served warm