// AI NATIVE STACK

AI Native › AI Agent › Agent Tool › Firecrawl

CRASH COURSE · AI-NATIVE · beginner · 9 min read · v1

Firecrawl — turn any website into clean LLM-ready markdown.

agent-tool ai-native firecrawl scraping python

TL;DR — Firecrawl is a web scraping API that crawls websites and returns clean markdown, structured data, or screenshots optimized for LLM consumption. Handles JavaScript rendering, anti-bot bypasses, and sitemaps. Use as an API service or self-host. The fastest way to feed web content to your agent.

What it is

Firecrawl is an API service (and open-source self-hostable project) that takes a URL and returns clean, LLM-friendly content. It renders JavaScript, strips boilerplate (nav, ads, footers), and outputs markdown, HTML, or structured JSON. Think of it as a "URL → clean text" pipeline purpose-built for feeding content into language models and RAG pipelines.

Why it exists

Raw web scraping is painful: JavaScript-heavy SPAs don't return content with simple HTTP requests, anti-bot measures block scrapers, and extracting the actual content from a page full of nav bars and cookie banners requires constant maintenance. Firecrawl handles all of this behind an API call so your agent can focus on reasoning, not parsing HTML.

Install & setup

pip install firecrawl-py
export FIRECRAWL_API_KEY=fc-...

Get an API key from firecrawl.dev. Free tier available. For self-hosting, clone the repo and run via Docker.

Basic scrape

Scrape a single page and get markdown back:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-...")

result = app.scrape_url("https://example.com/blog/post-1", {
    "formats": ["markdown"]
})
print(result["markdown"])

The returned markdown is cleaned — no nav, no ads, just the article content. Ready to drop into an LLM prompt or vector store.

Crawl an entire site

Crawl follows links and scrapes multiple pages:

crawl_result = app.crawl_url("https://docs.example.com", {
    "limit": 50,
    "scrapeOptions": {"formats": ["markdown"]}
})

for page in crawl_result["data"]:
    print(page["metadata"]["title"])
    print(page["markdown"][:200])
    print("---")

Firecrawl respects robots.txt, handles sitemaps, and deduplicates pages. Set limit to control how many pages to crawl.

Extract structured data

Use LLM extraction to pull structured fields from pages:

result = app.scrape_url("https://example.com/pricing", {
    "formats": ["extract"],
    "extract": {
        "schema": {
            "type": "object",
            "properties": {
                "plans": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "price": {"type": "string"},
                            "features": {"type": "array", "items": {"type": "string"}}
                        }
                    }
                }
            }
        }
    }
})
print(result["extract"])

When to use, when to skip

Use it when your agent needs to read web pages, build RAG pipelines from websites, or extract structured data from the web. Especially valuable for JS-heavy sites that requests.get() can't handle.

Skip it when you only need simple static pages (use requests + BeautifulSoup), when you need real-time browser interaction (use Browser Use or Selenium), or when cost is a concern and you can self-host alternatives.

vs the alternatives

ToolBest forTrade-off
FirecrawlClean markdown from any URL, site crawlsAPI cost, hosted dependency
Jina ReaderFree single-page URL→textNo crawling, less control
BeautifulSoupSimple static HTML parsingNo JS rendering, manual cleanup
Playwright/SeleniumFull browser interactionHeavy, requires scripting
Crawl4AIOpen-source LLM-ready crawlingSelf-host only, less polished

Verified against Firecrawl docs (docs.firecrawl.dev), May 2026.

← AI Native Stack
© cvam — written in plaintext, served warm