Why softmax temperature is just confidence, rebranded

Spent the morning unlearning what I thought I knew about logits. Notes from chapter 4 of the deep learning book + a sketch I drew on the back of a napkin while waiting for k8s pods to schedule. Turns out temperature is one of those things everyone uses but few people actually sit with long enough to build intuition.

The short version

Temperature (τ) is a knob for how much you trust your model's raw outputs. High temperature = "I'm not sure, spread the probability around." Low temperature = "I'm very confident, sharpen the distribution."

That's it. That's the whole mental model. Everything else is just the math confirming this.

What softmax actually does

Before we touch temperature, let's make sure softmax is solid. Given a vector of logits z — the raw, unnormalized outputs of a neural network — softmax converts them to a proper probability distribution:

softmax(z_i) = exp(z_i) / Σ exp(z_j)

Every output is between 0 and 1. They all sum to 1. The exponential does the heavy lifting: it makes big logits much bigger relative to small ones, creating a natural "winner takes more" dynamic.

The subtle thing people miss: softmax doesn't just normalize. It amplifies differences. A logit of 5.0 vs 4.0 — only 1.0 apart — becomes a probability ratio of roughly 2.7:1 after exponentiation. Softmax is opinionated by default.

Enter temperature

When you introduce temperature, you divide each logit by τ before applying softmax:

softmax(z_i / τ) = exp(z_i / τ) / Σ exp(z_j / τ)

This single division changes everything about the output distribution. Let's walk through the extremes.

τ → 0 (freezing cold)

Dividing by a tiny number makes all logits huge, but the largest one dominates exponentially. The distribution collapses to a one-hot vector — pure argmax. The model becomes maximally confident, always picking its top choice with probability ≈ 1.

τ = 1 (room temperature)

No change. You get the distribution the model was trained to produce. This is the default behavior.

τ → ∞ (boiling hot)

Dividing by a massive number squashes all logits toward zero. exp(0) = 1 for everything, so the distribution approaches uniform. The model says "I genuinely have no idea, all options are equally valid."

The key insight: temperature doesn't change what the model learned. It changes how decisively the model acts on what it learned.

Where the name comes from

The "temperature" metaphor is borrowed from statistical mechanics, specifically the Boltzmann distribution:

P(state_i) = exp(-E_i / kT) / Σ exp(-E_j / kT)

Same structure. In physics, high temperature means high entropy — particles explore more states randomly. Low temperature means particles settle into low-energy states predictably. The analogy is perfect:

Low T → system settles into the most likely state (greedy, deterministic)
High T → system explores many states (random, creative)

This isn't a loose metaphor. It's the exact same equation with different variable names. The logits are negative energies. The temperature scales exploration. That's it.

The napkin sketch

I drew three softmax distributions at τ=0.5, τ=1.0, and τ=2.0 over a 4-class problem with logits [2.0, 1.0, 0.5, 0.1].

Softmax temperature napkin sketch showing probability distributions at τ=0.5, τ=1.0, and τ=2.0

At τ=0.5, the top class gets ~73% probability. It's a spike — the model is very sure. At τ=1.0, the top class drops to ~46%. Still the winner, but not by a mile. At τ=2.0, probabilities are roughly 32%, 27%, 24%, 18%. Almost flat. The model shrugs.

The visual difference is striking. Same logits, same model, same learned knowledge. Completely different behavior.

Where you actually see this

Temperature shows up everywhere in modern ML, and knowing what it does makes a lot of API parameters suddenly make sense.

LLM text generation

When you set temperature=0.2 in the OpenAI API, you're telling the model "be predictable, pick the most likely tokens." At temperature=1.5, you get more creative, surprising (and sometimes unhinged) completions. It's the same model either way — you're just turning the confidence dial.

Knowledge distillation

Hinton's distillation paper uses high temperature to "soften" a teacher model's outputs. Why? Because the relative probabilities of wrong answers carry useful information. At τ=1, the teacher might say "it's a 3" with 99% confidence. At τ=20, it might say "it's a 3 (40%) but also looks like an 8 (25%) and a 5 (15%)." That softer distribution is a richer teaching signal.

Reinforcement learning

The explore-exploit tradeoff. High temperature early in training = explore more actions. Anneal the temperature down over time = gradually commit to what works. Same principle, different domain.

Common mistakes I've seen (and made)

Confusing temperature with top-k/top-p. Temperature reshapes the entire distribution. Top-k truncates it. Top-p clips by cumulative probability. They're different knobs — and they compose.
Thinking τ=0 is safe. Greedy decoding (τ→0) gives you the single most likely sequence, but that's not always the best sequence. Beam search with moderate temperature often produces better results.
Not realizing temperature is applied before sampling. The softmax distribution is computed with temperature, then you sample from that distribution. Temperature doesn't add randomness — it modulates how much randomness the existing distribution contains.

The mental model that sticks

Forget the physics metaphor if it doesn't click. Here's the one that works for me:

Temperature is how many drinks your model has had. Sober (τ→0) = rigid, by-the-book, won't take risks. Tipsy (τ=1) = normal, balanced judgment. Hammered (τ→∞) = says anything, zero filter.

Crude? Yes. Accurate? Also yes.

Takeaway

Next time someone says "adjust the temperature," just read it as "adjust the confidence." It clicks faster that way. The model already knows what it knows. Temperature just decides how boldly it acts on that knowledge.

One parameter. No retraining. Complete control over the confidence-creativity tradeoff. Pretty elegant, honestly.

Why softmax temperature is just confidence, rebranded.