TL;DR — Firecrawl is a web scraping API that crawls websites and returns clean markdown, structured data, or screenshots optimized for LLM consumption. Handles JavaScript rendering, anti-bot bypasses, and sitemaps. Use as an API service or self-host. The fastest way to feed web content to your agent.
What it is
Firecrawl is an API service (and open-source self-hostable project) that takes a URL and returns clean, LLM-friendly content. It renders JavaScript, strips boilerplate (nav, ads, footers), and outputs markdown, HTML, or structured JSON. Think of it as a "URL → clean text" pipeline purpose-built for feeding content into language models and RAG pipelines.
Why it exists
Raw web scraping is painful: JavaScript-heavy SPAs don't return content with simple HTTP requests, anti-bot measures block scrapers, and extracting the actual content from a page full of nav bars and cookie banners requires constant maintenance. Firecrawl handles all of this behind an API call so your agent can focus on reasoning, not parsing HTML.
Install & setup
pip install firecrawl-py
export FIRECRAWL_API_KEY=fc-...
Get an API key from firecrawl.dev. Free tier available. For self-hosting, clone the repo and run via Docker.
Basic scrape
Scrape a single page and get markdown back:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-...")
result = app.scrape_url("https://example.com/blog/post-1", {
"formats": ["markdown"]
})
print(result["markdown"])
The returned markdown is cleaned — no nav, no ads, just the article content. Ready to drop into an LLM prompt or vector store.
Crawl an entire site
Crawl follows links and scrapes multiple pages:
crawl_result = app.crawl_url("https://docs.example.com", {
"limit": 50,
"scrapeOptions": {"formats": ["markdown"]}
})
for page in crawl_result["data"]:
print(page["metadata"]["title"])
print(page["markdown"][:200])
print("---")
Firecrawl respects robots.txt, handles sitemaps, and deduplicates pages. Set limit to control how many pages to crawl.
Extract structured data
Use LLM extraction to pull structured fields from pages:
result = app.scrape_url("https://example.com/pricing", {
"formats": ["extract"],
"extract": {
"schema": {
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"features": {"type": "array", "items": {"type": "string"}}
}
}
}
}
}
}
})
print(result["extract"])
When to use, when to skip
Use it when your agent needs to read web pages, build RAG pipelines from websites, or extract structured data from the web. Especially valuable for JS-heavy sites that requests.get() can't handle.
Skip it when you only need simple static pages (use requests + BeautifulSoup), when you need real-time browser interaction (use Browser Use or Selenium), or when cost is a concern and you can self-host alternatives.
vs the alternatives
| Tool | Best for | Trade-off |
|---|---|---|
| Firecrawl | Clean markdown from any URL, site crawls | API cost, hosted dependency |
| Jina Reader | Free single-page URL→text | No crawling, less control |
| BeautifulSoup | Simple static HTML parsing | No JS rendering, manual cleanup |
| Playwright/Selenium | Full browser interaction | Heavy, requires scripting |
| Crawl4AI | Open-source LLM-ready crawling | Self-host only, less polished |
Verified against Firecrawl docs (docs.firecrawl.dev), May 2026.