You open X. Before your thumb finishes its swipe, a feed appears — 50 posts, ranked, filtered, personalized. One of them is from someone you've never followed. You like it. The algorithm smiles.
Behind that one-second experience is a seven-stage pipeline written in Rust and Python, a Grok-based transformer that predicts 15 different things you might do with a post, and an in-memory store that serves your followees' tweets faster than your database can say "SELECT." xAI just open-sourced the whole thing.
Why does this matter?
Social media algorithms are usually black boxes. You see posts, you don't know why. Regulators have been screaming for transparency. Elon Musk promised to open-source the algorithm. And now, for the first time, we can read every line of the system that decides what 500 million people see every day.
The interesting part isn't the politics — it's the engineering. This system replaces every single hand-crafted feature with a transformer that learns relevance end-to-end from your engagement history. No manual rules about "posts with images rank higher" or "replies boost visibility." The model figures all of that out itself.
That's a bold bet. Let's see how they pulled it off.
The big picture: seven stages, one feed
Every time you pull down to refresh, a request hits Home Mixer — the orchestration service written in Rust. It runs your request through seven stages, like an assembly line where each station does exactly one job:
- Query Hydration — Who are you? What have you liked, replied to, reposted recently? Your engagement history gets fetched.
- Candidate Sourcing — Two systems run in parallel to find posts you might want to see (more on this below).
- Candidate Hydration — Each candidate post gets enriched with metadata: text, media, author info, video duration, subscription labels.
- Pre-Scoring Filters — Junk removal. Duplicates, posts older than a day, your own posts, blocked authors, muted keywords — all gone.
- Scoring — The ML model predicts how likely you are to engage with each surviving post.
- Selection — Sort by score, take the top K.
- Post-Selection Filters — Final safety checks. Spam? Violence? Multiple branches of the same thread? Removed.
What comes out the other end is your feed. The whole thing happens in under a second.
Think of it like a restaurant. Stage 1 reads your preferences. Stage 2 pulls ingredients from two different pantries. Stages 3–4 prep and clean. Stage 5 is the chef cooking. Stages 6–7 plate and do a final taste test. You just see the finished dish.
The two pantries: Thunder and Phoenix Retrieval
Stage 2 is where things get interesting. Your feed isn't just posts from people you follow — it's a mix of in-network and out-of-network content. Two completely different systems fetch these.
Thunder — your followees' posts, served from RAM
Thunder is an in-memory store written in Rust. It listens to a Kafka topic that streams every new post, reply, and repost on X. It partitions them by author and keeps them in hashmaps — original tweets in one queue, replies in another, videos in a third — each with a configurable retention window (typically about a day).
When your feed request arrives, Thunder looks up every account you follow and returns their recent posts. No database queries. No disk I/O. Pure in-memory lookups, sub-millisecond.
It's like having a personal assistant who already knows what your friends said today because they've been listening the entire time. You don't ask a database — you ask someone with perfect short-term memory.
Phoenix Retrieval — finding posts from strangers
This is the ML-powered half. Phoenix's retrieval model uses a two-tower architecture — one of the most elegant ideas in recommendation systems.
Imagine two buildings facing each other across a courtyard. In the left building (the User Tower), your entire engagement history gets compressed into a single vector — a point in 128-dimensional space. A small transformer reads your recent likes, replies, and reposts, and outputs one embedding that represents you.
In the right building (the Candidate Tower), every post on X gets compressed into a vector in the same space. This tower uses hash-based embeddings and a 2-layer MLP with SiLU activation — simpler than the user side, because it needs to encode millions of posts efficiently.
Finding posts you'd like becomes a geometry problem: which post vectors are closest to your user vector? A dot product measures similarity. Approximate nearest-neighbor search (think FAISS) makes this fast enough to run at scale.
Two-tower retrieval is like organizing a library where every book and every reader gets a GPS coordinate. Finding your next read means looking for the nearest book to where you're standing. No need to read every title — just check the map.
Together, Thunder and Phoenix Retrieval produce roughly 1,000–2,000 raw candidates. That's a lot of posts. Most of them won't survive the next four stages.
The chef: Phoenix's ranking transformer
Here's where the real magic happens. Those ~1,500 candidates need to be ranked — and this is where xAI pulls out the big guns.
The ranking model is a transformer derived from Grok-1, xAI's large language model. But instead of generating text, it predicts what you'll do with a post. Specifically, it outputs probabilities for 15+ different actions:
- P(like), P(reply), P(repost), P(quote)
- P(click), P(profile_click), P(share)
- P(video_view), P(photo_expand), P(dwell)
- P(follow_author)
- P(not_interested), P(block_author), P(mute_author), P(report)
Notice the last four. The model doesn't just predict what you'll like — it predicts what you'll hate. Posts that score high on P(block) or P(not_interested) get pushed down hard.
The final relevance score is dead simple — a weighted sum:
Fig 1 — The scoring formula. Elegantly simple on the surface, brutally effective underneath.
The weights are manually configured — not learned. That's a deliberate design choice. It lets engineers directly control how much a "like" matters versus a "repost" versus a "block." You can imagine the knob-turning sessions in the xAI office.
The trick that makes it all work: candidate isolation
This is the part that made me stop scrolling.
In a normal transformer, every token attends to every other token. If you batch 64 candidate posts into one forward pass, each post's representation would be influenced by the other 63 posts in the batch. That's a problem — it means a post's score could change depending on what other posts happen to be in the same batch.
Phoenix solves this with candidate isolation — a custom attention mask that enforces a simple rule:
Fig 2 — Each candidate sees only the user context and itself. Never other candidates.
Candidates can attend to the user token and the entire engagement history — but they cannot attend to each other. Each candidate only sees itself and your context. The attention mask is block-diagonal below the separator.
Why does this matter? Three reasons:
- Deterministic scores. A post's score depends only on its content and your history — never on which other posts happen to be in the batch. Shuffle the candidates, get the same scores.
- Cacheability. If a post's score doesn't depend on its neighbors, you can cache it. Score a post once, reuse it across multiple feed requests until the user's history changes.
- No ordering effects. Without isolation, putting a controversial post next to a wholesome one could change both their scores. That's unpredictable and hard to debug. Isolation kills that problem entirely.
The trade-off? You lose cross-candidate signals. If showing contrasting posts together would be more engaging, the model can't learn that. But xAI decided stability and cacheability were worth more than whatever marginal signal cross-attention might provide.
It's like grading exams. If you grade each student's paper in isolation, every score is fair and reproducible. If you grade them side-by-side, the B+ paper next to the A+ paper starts looking worse than it is. Candidate isolation is the fair grading policy.
No features, no problem?
Traditional recommendation systems are obsessed with feature engineering. "This post has an image → boost by 1.2x." "This author has high engagement rate → multiply by 1.5x." "Post was published in the last hour → freshness bonus." Engineers hand-craft hundreds of these signals.
X's system throws all of that away.
The Grok-based transformer gets raw hash embeddings (author ID hashed into a fixed-size table, post content hashed similarly) and your engagement sequence. That's it. No "this post has a video" feature. No "author follower count" feature. The transformer learns whatever signals matter directly from the data.
This is a radical simplification. It means you don't need a team of feature engineers maintaining hundreds of data pipelines. You don't need to argue about whether "time since post" should be linear or logarithmic. The model just… figures it out.
The risk? If the model doesn't have enough capacity or data, it might miss signals that a human engineer would have caught. But at X's scale — hundreds of millions of users generating billions of engagement signals daily — data is not the bottleneck.
The architecture in numbers
The Phoenix README provides a mini reference configuration for the ranking model:
Fig 3 — The "mini" config ships with the repo. Production likely runs much larger.
This is the demo configuration. The production model almost certainly uses larger embeddings, more layers, and longer history sequences. But the architecture is the same — what changes is the scale, not the design.
The filter gauntlet
Before and after scoring, candidates run through a gauntlet of filters. These aren't glamorous, but they're essential. A recommendation system that surfaces spam, blocked authors, or posts you've already seen is a broken one.
Pre-scoring filters (applied before the model runs):
DropDuplicatesFilter— same post ID appears twice? Kill the duplicate.AgeFilter— older than ~24 hours? Out.SelfpostFilter— your own post? You already know about it.AuthorSocialgraphFilter— blocked or muted author? Gone.MutedKeywordFilter— contains a word you muted? Invisible.PreviouslySeenPostsFilter— you already scrolled past it? Don't show again.IneligibleSubscriptionFilter— paywalled and you're not subscribed? Skip.
Post-selection filters (applied after ranking, on the final set):
VFFilter— visibility filter catches posts flagged as spam, violence, or gore after they were scored.DedupConversationFilter— prevents showing three different branches of the same thread.
The two-phase approach is smart. Running ML scoring is expensive, so you want to eliminate obviously ineligible posts before sending them to the model. But some checks (like visibility flags) might update after scoring, so you need a final safety net.
The diversity problem
Without intervention, a pure relevance-ranked feed would show you 10 posts from your favorite account, followed by 10 more from your second favorite. Technically optimal. Practically boring.
The Author Diversity Scorer handles this. When multiple top-ranked posts come from the same author, it applies an attenuation penalty — each successive post from the same person gets a lower multiplier. Your feed stays varied without sacrificing overall relevance.
There's also an OON (Out-of-Network) Scorer that can boost or adjust scores for posts from accounts you don't follow. This ensures your feed isn't just an echo chamber of your existing network — it actively injects discovery.
The code: Rust + Python, microservices not monolith
The repo is split into five major components:
- Home Mixer (Rust) — the orchestrator. Implements a gRPC server, runs the 7-stage pipeline, plugs together sources/hydrators/filters/scorers using the candidate-pipeline framework.
- Thunder (Rust) — the in-memory post store. Kafka consumer, per-author hashmaps, zero database queries at serving time.
- Phoenix (Python/JAX) — the ML brain. Two-tower retrieval + Grok-based transformer ranking. Uses JAX and Haiku.
- Candidate Pipeline (Rust crate) — the reusable framework. Defines traits for Source, Hydrator, Filter, Scorer, Selector, and SideEffect. Handles parallelism and fault tolerance.
- GroX (Python) — content understanding. Classifiers, embedders, spam detection, content categorization. Feeds features into Phoenix's input pipeline.
The language split is deliberate. Rust handles the hot path — serving requests, filtering candidates, orchestrating the pipeline — where latency matters down to microseconds. Python/JAX handles the ML — where expressiveness and GPU acceleration matter more than raw serving speed.
Running it yourself
The repo includes a runnable demo. Here's the gist:
- Download the pre-trained artifacts via Git LFS (~3 GB — model checkpoints + a sports news corpus of 537k posts with precomputed embeddings).
- Run the pipeline:
uv run run_pipeline.py --artifacts_dir artifacts/oss-phoenix-artifacts - It loads the models, takes an example user history, retrieves top-200 posts from the corpus, ranks them, and prints predicted probabilities + final scores.
You can edit example_sequence.json to change the user history, or use
--top_k_retrieval and --top_k_display flags to control
how many candidates get retrieved and displayed.
Fair warning: the Rust components don't ship with a complete Cargo.toml setup, so building the full serving infrastructure requires some manual assembly. This is a reference implementation, not a Docker-compose-and-go deployment.
What's missing (and what's honest about that)
Let's be real about the limitations:
- No benchmarks. The repo doesn't publish accuracy, AUC, or any engagement metrics. It's a codebase release, not a research paper. You can run the demo and see ranked outputs, but there's no "we improved CTR by X%."
- No fairness controls. There's no explicit debiasing in the code. The model learns from engagement data, which means it can amplify whatever biases exist in how people interact with content.
- Cold start is cold. New users with no history and new posts with no engagement signals rely entirely on hash embeddings — which is basically random. There's no specialized cold-start model.
- Candidate isolation loses signal. By design, the model can't learn that showing diverse posts together might be more engaging. Cross-post context is sacrificed for determinism.
- Latency at scale. Running a Grok-style transformer on hundreds of candidates per request isn't cheap. The mini config is small; production models are almost certainly larger and more expensive.
These aren't criticisms — they're engineering trade-offs. Every system makes them. What's rare is being able to see them.
The one-paragraph version
X's "For You" feed is a seven-stage pipeline. Thunder (in-memory Rust store) provides posts from your network; Phoenix Retrieval (two-tower ML model) finds posts from strangers. Both get ranked by a Grok-based transformer that predicts 15 engagement probabilities per post — likes, replies, reposts, but also blocks and mutes. Scores are a weighted sum of those probabilities. A custom attention mask (candidate isolation) ensures each post's score is independent of its batch neighbors. The whole system replaces hand-engineered features with end-to-end learning. No manual rules. Just a transformer, your history, and a lot of Rust.
The repo is at github.com/xai-org/x-algorithm. 18.8k stars. Rust + Python. Apache 2.0 license. Go read it.