// AI NATIVE STACK

AI Native › AI Native Infra › Workload Runtime › Ollama

CRASH COURSE · AI-NATIVE · beginner · 9 min read · local

Ollama — run an LLM on your own machine in one command.

workload-runtime ai-native ollama local-llm llama-cpp

TL;DR — Ollama is the easiest way to run LLMs locally. It wraps llama.cpp behind a single CLI + background server, auto-handles model download (GGUF), quantization, and GPU offload, and exposes an OpenAI-compatible API on :11434. ollama run llama3.1 and you're chatting. A Modelfile customizes models. Built for dev, privacy, and offline — not high-concurrency production.

What it is

Ollama is the default CLI-and-server tool for running open LLMs on your own hardware. It wraps the llama.cpp inference engine with one-command model management, pulls quantized GGUF weights, picks GPU/CPU offload automatically, and serves an OpenAI-compatible REST API. In the AI Native landscape it's in AI Native Infra › Workload Runtime as the local/edge end of the spectrum.

Why it exists

Running a model locally used to mean compiling llama.cpp, hunting for the right GGUF quant, and wiring your own server. Ollama makes it a package-manager experience: pull, run, done. The payoff is privacy (data never leaves the machine), zero per-token cost, and offline capability — ideal for development, prototyping, and sensitive data.

How it works

Installing Ollama starts a background server on localhost:11434. The CLI talks to it; so can your code. Pull a model and it downloads the GGUF and prepares it; run it and Ollama loads it (offloading layers to GPU where it can) and starts serving. The model stays warm for follow-up requests.

CLI / appOpenAI SDK Ollama server:11434 · GGUFGPU offload llama.cpp enginelocal weights

Fig 1 — One local server fronts llama.cpp and speaks the OpenAI API.

Modelfile — customizing a model

A Modelfile declares a model variant: its base, a system prompt, parameters (temperature, context length), even a LoRA adapter — analogous to a Dockerfile for models. ollama create builds your named variant from it, which you then run like any other model.

Quick start

curl -fsSL https://ollama.com/install.sh | sh     # macOS/Windows: use the app
ollama run llama3.1                                # pulls + chats

# use it like OpenAI from code:
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1","messages":[{"role":"user","content":"hi"}]}'

Point any OpenAI client at base_url=http://localhost:11434/v1 with a dummy key. The native API lives at /api/chat.

When to use, when to skip

Use it for local development, private/offline inference, edge devices, and quick experiments across the 100+ model library — anywhere ease and data locality beat raw throughput. It's the friendliest on-ramp to open models.

Skip it for high-concurrency production serving — that's vLLM's job (PagedAttention + continuous batching scale far better). On Kubernetes you'd serve via KServe/vLLM, not Ollama. Use Ollama to prototype, then graduate the workload.

heads up Ollama optimizes for single-user simplicity, not throughput — it won't match vLLM under concurrent load. And model quality depends on the quant you pull: a 4-bit model is smaller and faster but lower fidelity than 8-bit/full precision. Match the quant to your RAM/VRAM.

vs the alternatives

ToolBest forTrade-off
OllamaEasy local/dev/offline LLMsNot high-concurrency
llama.cppMax control, embedded, the engine itselfMore manual setup
LM StudioGUI local model runningDesktop-app oriented
vLLMProduction throughput servingHeavier, GPU-focused

References

Extra reads

Verified against the official Ollama docs (docs.ollama.com), May 2026.

← AI Native Stack
© cvam — written in plaintext, served warm