How to Scrape Websites for AI Training Data & RAG Pipelines with Python
Learn how to scrape clean, structured web data for AI training and RAG pipelines using Python, Crawl4AI, and Firecrawl. Step-by-step 2026 guide with working code.
Senior Developer

Why Every AI Developer Needs to Know How to Scrape the Web
Here's the uncomfortable truth about building AI applications in 2026: the quality of your output is only as good as the quality of your input data.
Pre-trained models like GPT-4 or Claude have a knowledge cutoff. They don't know about the paper published last Tuesday, the product launched last month, or the price change that happened this morning. If your AI application needs current, domain-specific, or proprietary knowledge, you have exactly two options:
Fine-tune the model on your own data (expensive, slow, requires thousands of examples)
Use Retrieval-Augmented Generation (RAG) — give the model real-time access to scraped, structured knowledge
Option 2 is almost always the right answer. And web scraping is the engine that powers it.
According to Zyte's 2026 Web Scraping Industry Report, AI-powered code generation, LLM-based extraction, and intelligent browser automation are compressing development cycles dramatically — and a growing share of scraping pipelines now feed directly into LLM workflows.
The web scraping software market sits at $1.17 billion in 2026, growing at an 18.5% CAGR — and the AI-powered data extraction segment specifically is projected at $7.48 billion, trending toward $38.44 billion. Web scraping isn't adjacent to the AI wave. It's riding it.
This guide shows you exactly how to build a complete pipeline: scrape web content → clean it → embed it → store it in a vector database → query it with an LLM.
What Is a RAG Pipeline and Why Does It Need Web Data?
Large language models are limited by their training data. They don't know about the documentation update published yesterday, the product released this morning, or the article posted five minutes ago.
Retrieval-Augmented Generation (RAG) solves this by allowing an LLM to retrieve relevant information from an external knowledge base before generating a response. Instead of answering purely from memory, the model answers using information retrieved at query time.
User Question
↓
Similarity Search → Vector Database → Retrieve Relevant Chunks
↓
LLM (GPT-4 / Claude / Llama) + Context
↓
Grounded, Accurate AnswerThe quality of a RAG system depends heavily on the quality of the data inside that knowledge base. Documentation sites, research papers, blogs, news articles, and internal company content all need to be collected and kept fresh, which is where web scraping comes in.
The challenge is that raw HTML contains a huge amount of noise: navigation menus, cookie banners, footers, advertisements, and tracking scripts. Modern AI-focused scrapers remove that noise and produce clean, structured content that can be embedded, stored in vector databases, and retrieved efficiently by LLMs.
The AI Scraping Stack in 2026
Before we write code, here are the tools we'll use and why:
Tool | Role | Why It Matters for AI |
|---|---|---|
Crawl4AI | Open-source crawler optimised for RAG | Returns clean Markdown, handles JS, free |
Firecrawl | Managed API for LLM-ready content | Zero infra, markdown output, JS rendering |
ScrapeGraphAI | LLM-powered extraction via natural language | No CSS selectors, adapts to layout changes |
ChromaDB | Local vector database | Stores and queries embeddings |
sentence-transformers | Embedding model | Converts text chunks to vectors |
LangChain | RAG orchestration | Connects scrapers, embeddings, and LLMs |
Part 1: Crawl4AI — The Open-Source RAG Scraper
Crawl4AI is an open-source Python crawler built specifically for RAG pipelines. It generates clean Markdown optimised for RAG with BM25-based content filtering, supports LLM-powered structured extraction, and handles full-site crawling with link following and depth control — with no per-request costs.
This makes it the go-to choice for teams who want full control without paying per-request fees.
Installation
pip install crawl4ai
crawl4ai-setup # Downloads browser binaries — required first timeBasic usage: scrape a single page to clean Markdown
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
async def scrape_to_markdown(url: str) -> str:
"""
Scrape a URL and return clean, LLM-ready Markdown.
Crawl4AI strips navigation, footers, ads, and boilerplate automatically.
"""
async with AsyncWebCrawler(verbose=False) as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=10, # Skip tiny elements (nav links, buttons)
remove_overlay_elements=True, # Remove popups, cookie banners
process_iframes=False,
)
if result.success:
return result.markdown_v2.raw_markdown
else:
raise Exception(f"Crawl failed: {result.error_message}")
# Test it
markdown = asyncio.run(scrape_to_markdown("https://docs.python.org/3/library/asyncio.html"))
print(markdown[:500])
print(f"\nTotal length: {len(markdown)} characters")The output is clean, stripped Markdown with no HTML tags, no navigation menus, no cookie consent text — exactly what an LLM needs.
Crawling an entire documentation site
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.chunking_strategy import RegexChunking
from urllib.parse import urljoin, urlparse
import json
async def crawl_documentation_site(
start_url: str,
max_pages: int = 50,
same_domain_only: bool = True
) -> list[dict]:
"""
Crawl a documentation site and return a list of
{'url': ..., 'title': ..., 'content': ...} dicts — ready for RAG ingestion.
"""
base_domain = urlparse(start_url).netloc
visited = set()
queue = [start_url]
pages = []
async with AsyncWebCrawler(verbose=False) as crawler:
while queue and len(pages) < max_pages:
url = queue.pop(0)
if url in visited:
continue
visited.add(url)
print(f"[{len(pages)+1}/{max_pages}] Crawling: {url}")
result = await crawler.arun(
url=url,
word_count_threshold=15,
remove_overlay_elements=True,
)
if not result.success:
continue
pages.append({
"url": url,
"title": result.metadata.get("title", ""),
"content": result.markdown_v2.raw_markdown,
"scraped_at": result.metadata.get("timestamp", ""),
})
# Discover new links on the same domain
if same_domain_only:
for link in (result.links.get("internal", []) or []):
href = link.get("href", "")
if href and href not in visited:
parsed = urlparse(href)
if parsed.netloc == base_domain or not parsed.netloc:
queue.append(urljoin(url, href))
print(f"\nCrawled {len(pages)} pages.")
return pages
pages = asyncio.run(crawl_documentation_site(
"https://docs.python.org/3/",
max_pages=30
))
# Save raw crawl output
with open("docs_crawl.json", "w") as f:
json.dump(pages, f, indent=2, ensure_ascii=False)Part 2: Cleaning and Chunking for Optimal RAG Performance
Raw page content needs to be split into chunks before embedding. The chunk size is critical: too small and you lose context; too large and retrieval becomes imprecise.
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
def chunk_markdown_page(page: dict) -> list[dict]:
"""
Split a Markdown page into semantically meaningful chunks.
Preserve heading context in each chunk for better retrieval.
"""
# First: split by Markdown headers to preserve semantic sections
headers_to_split = [
("#", "h1"),
("##", "h2"),
("###","h3"),
]
header_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split,
strip_headers=False # Keep headers in chunk text for context
)
# Then: split long sections by character count
char_splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # ~600 tokens — sweet spot for most models
chunk_overlap=80, # 10% overlap prevents losing context at boundaries
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = []
header_docs = header_splitter.split_text(page["content"])
for doc in header_docs:
# If section is short enough, keep as one chunk
if len(doc.page_content) <= 900:
chunks.append({
"content": doc.page_content,
"source_url": page["url"],
"page_title": page["title"],
"section": doc.metadata.get("h2") or doc.metadata.get("h1", ""),
"char_count": len(doc.page_content),
})
else:
# Split large sections further
sub_chunks = char_splitter.split_text(doc.page_content)
for i, chunk_text in enumerate(sub_chunks):
chunks.append({
"content": chunk_text,
"source_url": page["url"],
"page_title": page["title"],
"section": doc.metadata.get("h2") or doc.metadata.get("h1", ""),
"chunk_index": i,
"char_count": len(chunk_text),
})
return chunks
# Process all pages
all_chunks = []
for page in pages:
chunks = chunk_markdown_page(page)
all_chunks.extend(chunks)
print(f"Total chunks: {len(all_chunks)}")
print(f"Average chunk size: {sum(c['char_count'] for c in all_chunks) / len(all_chunks):.0f} chars")Part 3: Embedding and Storing in ChromaDB
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
# Load embedding model (runs locally, no API key needed)
# "all-MiniLM-L6-v2" is fast and good; use "all-mpnet-base-v2" for higher quality
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# Initialise ChromaDB (local persistent storage)
chroma_client = chromadb.PersistentClient(
path="./chroma_db",
settings=Settings(anonymized_telemetry=False)
)
collection = chroma_client.get_or_create_collection(
name="python_docs",
metadata={"hnsw:space": "cosine"} # Cosine similarity for text
)
def embed_and_store(chunks: list[dict], batch_size: int = 64):
"""Embed chunks and store in ChromaDB with metadata."""
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
texts = [c["content"] for c in batch]
print(f"Embedding batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1}...")
embeddings = embedder.encode(texts, show_progress_bar=False).tolist()
collection.add(
ids=[str(uuid.uuid4()) for _ in batch],
embeddings=embeddings,
documents=texts,
metadatas=[{
"source_url": c["source_url"],
"page_title": c["page_title"],
"section": c.get("section", ""),
} for c in batch]
)
print(f"Stored {len(chunks)} chunks in ChromaDB.")
embed_and_store(all_chunks)
print(f"Collection size: {collection.count()} documents")Part 4: Querying — The RAG Retrieval Loop
def retrieve(query: str, top_k: int = 5) -> list[dict]:
"""
Embed a query and retrieve the most relevant chunks from ChromaDB.
"""
query_embedding = embedder.encode([query]).tolist()
results = collection.query(
query_embeddings=query_embedding,
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
retrieved = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
retrieved.append({
"content": doc,
"source_url": meta["source_url"],
"section": meta["section"],
"similarity": round(1 - dist, 3), # Convert cosine distance to similarity
})
return retrieved
# Test retrieval
results = retrieve("How does asyncio event loop work?", top_k=3)
for r in results:
print(f"\n--- {r['section']} (similarity: {r['similarity']}) ---")
print(r["content"][:200])
print(f"Source: {r['source_url']}")Part 5: The Full RAG Answer Pipeline
Now we connect retrieval to an LLM for grounded, cited answers:
import anthropic # or: from openai import OpenAI
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env variable
def rag_answer(question: str, top_k: int = 4) -> dict:
"""
Full RAG pipeline: retrieve relevant chunks → build prompt → get LLM answer.
Returns the answer and source citations.
"""
# Step 1: Retrieve relevant context
chunks = retrieve(question, top_k=top_k)
if not chunks:
return {"answer": "No relevant information found.", "sources": []}
# Step 2: Build the context block
context_parts = []
for i, chunk in enumerate(chunks):
context_parts.append(
f"[Source {i+1}: {chunk['section']} — {chunk['source_url']}]\n"
f"{chunk['content']}"
)
context = "\n\n---\n\n".join(context_parts)
# Step 3: Construct the prompt
system_prompt = """You are a helpful technical assistant. Answer the user's
question based ONLY on the provided context. If the context doesn't contain
enough information to answer fully, say so clearly. Always cite your sources
using the [Source N] notation from the context."""
user_prompt = f"""Context:
{context}
Question: {question}
Answer (cite sources):"""
# Step 4: Call the LLM
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1000,
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}]
)
return {
"answer": response.content[0].text,
"sources": [c["source_url"] for c in chunks],
"chunks_used": len(chunks),
}
# Test the full pipeline
result = rag_answer("How do I use asyncio.gather() with error handling?")
print(result["answer"])
print("\nSources:")
for src in result["sources"]:
print(f" - {src}")Part 6: Using ScrapeGraphAI for Natural Language Extraction
For structured extraction without writing CSS selectors, ScrapeGraphAI crossed 15,000 GitHub stars by letting developers describe extraction requirements in plain English — the LLM builds the scraper without XPath or CSS selectors.
pip install scrapegraphaifrom scrapegraphai.graphs import SmartScraperGraph
# Define your scraping task in plain English
graph_config = {
"llm": {
"api_key": "YOUR_ANTHROPIC_OR_OPENAI_KEY",
"model": "claude-sonnet-4-20250514",
},
"verbose": False,
"headless": True,
}
# Describe what you want — no selectors needed
scraper = SmartScraperGraph(
prompt="""Extract all research papers listed on this page.
For each paper return: title, authors (as a list),
publication date, abstract (first 2 sentences), and URL.""",
source="https://arxiv.org/list/cs.AI/recent",
config=graph_config,
)
result = scraper.run()
# Returns structured JSON matching your description
import json
print(json.dumps(result, indent=2))The key advantage: when the website redesigns and changes its CSS classes, your scraper still works because it understands the content, not the structure.
Part 7: Keeping Your RAG Knowledge Base Fresh
A static scrape gets stale. Here's a lightweight refresh scheduler:
import asyncio
import json
from datetime import datetime, timezone, timedelta
import chromadb
REFRESH_INTERVAL_HOURS = 24
SOURCES = [
"https://docs.python.org/3/",
"https://playwright.dev/python/docs/intro",
"https://docs.scrapy.org/en/latest/",
]
async def refresh_knowledge_base():
"""Re-scrape all sources and update ChromaDB with new content."""
print(f"Starting knowledge base refresh at {datetime.now().isoformat()}")
# Track which documents are new vs unchanged
new_count = updated_count = 0
for source_url in SOURCES:
print(f"Re-crawling: {source_url}")
pages = await crawl_documentation_site(source_url, max_pages=20)
for page in pages:
# Check if this URL was scraped recently
existing = collection.get(
where={"source_url": page["url"]},
limit=1
)
if existing["ids"]:
# Delete old chunks for this URL before re-adding
collection.delete(where={"source_url": page["url"]})
updated_count += 1
else:
new_count += 1
chunks = chunk_markdown_page(page)
embed_and_store(chunks)
print(f"Refresh complete — {new_count} new pages, {updated_count} updated.")
return {"new": new_count, "updated": updated_count}
# Schedule with asyncio
async def scheduled_refresh():
while True:
await refresh_knowledge_base()
print(f"Next refresh in {REFRESH_INTERVAL_HOURS} hours.")
await asyncio.sleep(REFRESH_INTERVAL_HOURS * 3600)
# asyncio.run(scheduled_refresh()) # Uncomment to run continuouslyChoosing the Right Tool for Your AI Scraping Needs
Use case | Best tool | Reason |
|---|---|---|
Scraping docs for RAG, self-hosted | Crawl4AI | Free, open-source, Markdown output |
Scraping JS-heavy sites at scale | Firecrawl API | Handles anti-bot, clean Markdown |
Extracting structured data (no selectors) | ScrapeGraphAI | Natural language prompts, JSON output |
Feeding LangChain / LlamaIndex | Any + LangChain loaders | Direct integration available |
Training dataset construction | httpx + Crawl4AI | Volume + low cost |
Performance Numbers: What to Expect
For a typical documentation RAG pipeline:
Metric | Value |
|---|---|
Crawl speed (Crawl4AI, 10 concurrent) | ~8 pages/second |
Markdown size vs raw HTML | ~15% of original (85% reduction) |
Embedding time (MiniLM-L6-v2, CPU) | ~1,000 chunks/minute |
ChromaDB query latency (100k docs) | ~15ms |
RAG answer latency (retrieval + LLM) | ~1.2s average |
FAQ
Q: Is it legal to scrape websites for AI training data? The legal landscape is complex and evolving rapidly. Publicly accessible data has generally been treated as fair game under the theory of implied licence, but several lawsuits (particularly against AI companies scraping copyrighted content) are ongoing. Always check robots.txt, the site's Terms of Service, and consult legal counsel for commercial applications.
Q: What's the difference between Crawl4AI and Firecrawl? Crawl4AI is open-source and self-hosted — you run it yourself, no API costs. Firecrawl is a managed API — you pay per request, but it handles infrastructure, JavaScript rendering, and anti-bot bypass automatically. For prototyping, Crawl4AI. For production at scale, Firecrawl.
Q: How many chunks should a RAG database have? For a focused domain (e.g., one product's docs), 5,000–50,000 chunks works well with ChromaDB. For large-scale knowledge bases (hundreds of websites), use managed vector DBs like Pinecone or Weaviate which are optimised for billions of vectors.
Q: What chunk size is best for RAG? 800–1,000 characters (~600–750 tokens) with 10–15% overlap is the empirically best-performing range for most embedding models. Smaller chunks (under 300 chars) lose context; larger chunks (over 1,500 chars) degrade retrieval precision.
Q: Can I use this approach with local LLMs? Absolutely. Replace the Anthropic API call with Ollama (for local Llama 3 or Mistral). The retrieval pipeline is identical; only the final generation step changes.
Summary
Step | Tool | What it does |
|---|---|---|
1. Crawl | Crawl4AI / Firecrawl | Scrape pages → clean Markdown |
2. Chunk | LangChain TextSplitter | Split into retrieval-optimised segments |
3. Embed | sentence-transformers | Convert text to semantic vectors |
4. Store | ChromaDB | Index and persist embeddings |
5. Retrieve | ChromaDB query | Find most relevant chunks for a question |
6. Generate | Claude / GPT-4 | Answer grounded in retrieved context |
7. Refresh | asyncio scheduler | Keep knowledge base current |
Five years ago, web scraping was mostly associated with lead generation, price monitoring, and SEO tooling. In 2026, its role is much larger. Every RAG system, AI research assistant, internal knowledge bot, and real-time LLM application needs a way to collect fresh information. Models provide reasoning. Scrapers provide knowledge.
The developers who understand both will have a significant advantage over those who only understand one side of the stack.
Comments (0)
Login to post a comment.