ZyVOP Logo
Content That Connects
SeriesCategoriesTags
ZyVOP Logo
Content That Connects

Empowering developers and creators with cutting-edge insights, comprehensive tutorials, and innovative solutions for the digital future.

Content

  • Tags
  • Write Article

Company

  • About Us
  • Contact

Connect

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • DMCA Policy
  • Code of Conduct

© 2026 ZyVOP. Crafted with care for the developer community.

Made with ❤️ by the ZyVOP team
All systems operational
HomeHow to Scrape Websites for AI Training Data & RAG Pipelines with Python

How to Scrape Websites for AI Training Data & RAG Pipelines with Python

Learn how to scrape clean, structured web data for AI training and RAG pipelines using Python, Crawl4AI, and Firecrawl. Step-by-step 2026 guide with working code.

#AI#RAG#Retrieval Augmented Generation#Web Scraping#Crawl4AI#ChromaDB#LangChain#Vector Databases#LLMs#Data Engineering
Z
ZyVOP

Senior Developer

June 4, 2026
12 min read
1 views
How to Scrape Websites for AI Training Data & RAG Pipelines with Python

Why Every AI Developer Needs to Know How to Scrape the Web

Here's the uncomfortable truth about building AI applications in 2026: the quality of your output is only as good as the quality of your input data.

Pre-trained models like GPT-4 or Claude have a knowledge cutoff. They don't know about the paper published last Tuesday, the product launched last month, or the price change that happened this morning. If your AI application needs current, domain-specific, or proprietary knowledge, you have exactly two options:

  1. Fine-tune the model on your own data (expensive, slow, requires thousands of examples)

  2. Use Retrieval-Augmented Generation (RAG) — give the model real-time access to scraped, structured knowledge

Option 2 is almost always the right answer. And web scraping is the engine that powers it.

According to Zyte's 2026 Web Scraping Industry Report, AI-powered code generation, LLM-based extraction, and intelligent browser automation are compressing development cycles dramatically — and a growing share of scraping pipelines now feed directly into LLM workflows.

The web scraping software market sits at $1.17 billion in 2026, growing at an 18.5% CAGR — and the AI-powered data extraction segment specifically is projected at $7.48 billion, trending toward $38.44 billion. Web scraping isn't adjacent to the AI wave. It's riding it.

This guide shows you exactly how to build a complete pipeline: scrape web content → clean it → embed it → store it in a vector database → query it with an LLM.


What Is a RAG Pipeline and Why Does It Need Web Data?

Large language models are limited by their training data. They don't know about the documentation update published yesterday, the product released this morning, or the article posted five minutes ago.

Retrieval-Augmented Generation (RAG) solves this by allowing an LLM to retrieve relevant information from an external knowledge base before generating a response. Instead of answering purely from memory, the model answers using information retrieved at query time.

User Question
     ↓
Similarity Search → Vector Database → Retrieve Relevant Chunks
                                              ↓
                              LLM (GPT-4 / Claude / Llama) + Context
                                              ↓
                                    Grounded, Accurate Answer

The quality of a RAG system depends heavily on the quality of the data inside that knowledge base. Documentation sites, research papers, blogs, news articles, and internal company content all need to be collected and kept fresh, which is where web scraping comes in.

The challenge is that raw HTML contains a huge amount of noise: navigation menus, cookie banners, footers, advertisements, and tracking scripts. Modern AI-focused scrapers remove that noise and produce clean, structured content that can be embedded, stored in vector databases, and retrieved efficiently by LLMs.


The AI Scraping Stack in 2026

Before we write code, here are the tools we'll use and why:

Tool

Role

Why It Matters for AI

Crawl4AI

Open-source crawler optimised for RAG

Returns clean Markdown, handles JS, free

Firecrawl

Managed API for LLM-ready content

Zero infra, markdown output, JS rendering

ScrapeGraphAI

LLM-powered extraction via natural language

No CSS selectors, adapts to layout changes

ChromaDB

Local vector database

Stores and queries embeddings

sentence-transformers

Embedding model

Converts text chunks to vectors

LangChain

RAG orchestration

Connects scrapers, embeddings, and LLMs


Part 1: Crawl4AI — The Open-Source RAG Scraper

Crawl4AI is an open-source Python crawler built specifically for RAG pipelines. It generates clean Markdown optimised for RAG with BM25-based content filtering, supports LLM-powered structured extraction, and handles full-site crawling with link following and depth control — with no per-request costs.

This makes it the go-to choice for teams who want full control without paying per-request fees.

Installation

pip install crawl4ai
crawl4ai-setup   # Downloads browser binaries — required first time

Basic usage: scrape a single page to clean Markdown

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def scrape_to_markdown(url: str) -> str:
    """
    Scrape a URL and return clean, LLM-ready Markdown.
    Crawl4AI strips navigation, footers, ads, and boilerplate automatically.
    """
    async with AsyncWebCrawler(verbose=False) as crawler:
        result = await crawler.arun(
            url=url,
            word_count_threshold=10,     # Skip tiny elements (nav links, buttons)
            remove_overlay_elements=True, # Remove popups, cookie banners
            process_iframes=False,
        )

    if result.success:
        return result.markdown_v2.raw_markdown
    else:
        raise Exception(f"Crawl failed: {result.error_message}")


# Test it
markdown = asyncio.run(scrape_to_markdown("https://docs.python.org/3/library/asyncio.html"))
print(markdown[:500])
print(f"\nTotal length: {len(markdown)} characters")

The output is clean, stripped Markdown with no HTML tags, no navigation menus, no cookie consent text — exactly what an LLM needs.

Crawling an entire documentation site

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.chunking_strategy import RegexChunking
from urllib.parse import urljoin, urlparse
import json

async def crawl_documentation_site(
    start_url: str,
    max_pages: int = 50,
    same_domain_only: bool = True
) -> list[dict]:
    """
    Crawl a documentation site and return a list of
    {'url': ..., 'title': ..., 'content': ...} dicts — ready for RAG ingestion.
    """
    base_domain = urlparse(start_url).netloc
    visited = set()
    queue = [start_url]
    pages = []

    async with AsyncWebCrawler(verbose=False) as crawler:
        while queue and len(pages) < max_pages:
            url = queue.pop(0)
            if url in visited:
                continue
            visited.add(url)

            print(f"[{len(pages)+1}/{max_pages}] Crawling: {url}")
            result = await crawler.arun(
                url=url,
                word_count_threshold=15,
                remove_overlay_elements=True,
            )

            if not result.success:
                continue

            pages.append({
                "url": url,
                "title": result.metadata.get("title", ""),
                "content": result.markdown_v2.raw_markdown,
                "scraped_at": result.metadata.get("timestamp", ""),
            })

            # Discover new links on the same domain
            if same_domain_only:
                for link in (result.links.get("internal", []) or []):
                    href = link.get("href", "")
                    if href and href not in visited:
                        parsed = urlparse(href)
                        if parsed.netloc == base_domain or not parsed.netloc:
                            queue.append(urljoin(url, href))

    print(f"\nCrawled {len(pages)} pages.")
    return pages


pages = asyncio.run(crawl_documentation_site(
    "https://docs.python.org/3/",
    max_pages=30
))

# Save raw crawl output
with open("docs_crawl.json", "w") as f:
    json.dump(pages, f, indent=2, ensure_ascii=False)

Part 2: Cleaning and Chunking for Optimal RAG Performance

Raw page content needs to be split into chunks before embedding. The chunk size is critical: too small and you lose context; too large and retrieval becomes imprecise.

from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

def chunk_markdown_page(page: dict) -> list[dict]:
    """
    Split a Markdown page into semantically meaningful chunks.
    Preserve heading context in each chunk for better retrieval.
    """

    # First: split by Markdown headers to preserve semantic sections
    headers_to_split = [
        ("#",  "h1"),
        ("##", "h2"),
        ("###","h3"),
    ]
    header_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split,
        strip_headers=False  # Keep headers in chunk text for context
    )

    # Then: split long sections by character count
    char_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,          # ~600 tokens — sweet spot for most models
        chunk_overlap=80,        # 10% overlap prevents losing context at boundaries
        separators=["\n\n", "\n", ". ", " ", ""],
    )

    chunks = []
    header_docs = header_splitter.split_text(page["content"])

    for doc in header_docs:
        # If section is short enough, keep as one chunk
        if len(doc.page_content) <= 900:
            chunks.append({
                "content": doc.page_content,
                "source_url": page["url"],
                "page_title": page["title"],
                "section": doc.metadata.get("h2") or doc.metadata.get("h1", ""),
                "char_count": len(doc.page_content),
            })
        else:
            # Split large sections further
            sub_chunks = char_splitter.split_text(doc.page_content)
            for i, chunk_text in enumerate(sub_chunks):
                chunks.append({
                    "content": chunk_text,
                    "source_url": page["url"],
                    "page_title": page["title"],
                    "section": doc.metadata.get("h2") or doc.metadata.get("h1", ""),
                    "chunk_index": i,
                    "char_count": len(chunk_text),
                })

    return chunks


# Process all pages
all_chunks = []
for page in pages:
    chunks = chunk_markdown_page(page)
    all_chunks.extend(chunks)

print(f"Total chunks: {len(all_chunks)}")
print(f"Average chunk size: {sum(c['char_count'] for c in all_chunks) / len(all_chunks):.0f} chars")

Part 3: Embedding and Storing in ChromaDB

from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid

# Load embedding model (runs locally, no API key needed)
# "all-MiniLM-L6-v2" is fast and good; use "all-mpnet-base-v2" for higher quality
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Initialise ChromaDB (local persistent storage)
chroma_client = chromadb.PersistentClient(
    path="./chroma_db",
    settings=Settings(anonymized_telemetry=False)
)

collection = chroma_client.get_or_create_collection(
    name="python_docs",
    metadata={"hnsw:space": "cosine"}   # Cosine similarity for text
)

def embed_and_store(chunks: list[dict], batch_size: int = 64):
    """Embed chunks and store in ChromaDB with metadata."""

    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c["content"] for c in batch]

        print(f"Embedding batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1}...")
        embeddings = embedder.encode(texts, show_progress_bar=False).tolist()

        collection.add(
            ids=[str(uuid.uuid4()) for _ in batch],
            embeddings=embeddings,
            documents=texts,
            metadatas=[{
                "source_url":  c["source_url"],
                "page_title":  c["page_title"],
                "section":     c.get("section", ""),
            } for c in batch]
        )

    print(f"Stored {len(chunks)} chunks in ChromaDB.")


embed_and_store(all_chunks)
print(f"Collection size: {collection.count()} documents")

Part 4: Querying — The RAG Retrieval Loop

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """
    Embed a query and retrieve the most relevant chunks from ChromaDB.
    """
    query_embedding = embedder.encode([query]).tolist()

    results = collection.query(
        query_embeddings=query_embedding,
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    retrieved = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        retrieved.append({
            "content":    doc,
            "source_url": meta["source_url"],
            "section":    meta["section"],
            "similarity": round(1 - dist, 3),   # Convert cosine distance to similarity
        })

    return retrieved


# Test retrieval
results = retrieve("How does asyncio event loop work?", top_k=3)
for r in results:
    print(f"\n--- {r['section']} (similarity: {r['similarity']}) ---")
    print(r["content"][:200])
    print(f"Source: {r['source_url']}")

Part 5: The Full RAG Answer Pipeline

Now we connect retrieval to an LLM for grounded, cited answers:

import anthropic   # or: from openai import OpenAI

client = anthropic.Anthropic()   # Uses ANTHROPIC_API_KEY env variable

def rag_answer(question: str, top_k: int = 4) -> dict:
    """
    Full RAG pipeline: retrieve relevant chunks → build prompt → get LLM answer.
    Returns the answer and source citations.
    """
    # Step 1: Retrieve relevant context
    chunks = retrieve(question, top_k=top_k)

    if not chunks:
        return {"answer": "No relevant information found.", "sources": []}

    # Step 2: Build the context block
    context_parts = []
    for i, chunk in enumerate(chunks):
        context_parts.append(
            f"[Source {i+1}: {chunk['section']} — {chunk['source_url']}]\n"
            f"{chunk['content']}"
        )
    context = "\n\n---\n\n".join(context_parts)

    # Step 3: Construct the prompt
    system_prompt = """You are a helpful technical assistant. Answer the user's
question based ONLY on the provided context. If the context doesn't contain
enough information to answer fully, say so clearly. Always cite your sources
using the [Source N] notation from the context."""

    user_prompt = f"""Context:
{context}

Question: {question}

Answer (cite sources):"""

    # Step 4: Call the LLM
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        system=system_prompt,
        messages=[{"role": "user", "content": user_prompt}]
    )

    return {
        "answer":  response.content[0].text,
        "sources": [c["source_url"] for c in chunks],
        "chunks_used": len(chunks),
    }


# Test the full pipeline
result = rag_answer("How do I use asyncio.gather() with error handling?")
print(result["answer"])
print("\nSources:")
for src in result["sources"]:
    print(f"  - {src}")

Part 6: Using ScrapeGraphAI for Natural Language Extraction

For structured extraction without writing CSS selectors, ScrapeGraphAI crossed 15,000 GitHub stars by letting developers describe extraction requirements in plain English — the LLM builds the scraper without XPath or CSS selectors.

pip install scrapegraphai
from scrapegraphai.graphs import SmartScraperGraph

# Define your scraping task in plain English
graph_config = {
    "llm": {
        "api_key": "YOUR_ANTHROPIC_OR_OPENAI_KEY",
        "model": "claude-sonnet-4-20250514",
    },
    "verbose": False,
    "headless": True,
}

# Describe what you want — no selectors needed
scraper = SmartScraperGraph(
    prompt="""Extract all research papers listed on this page.
              For each paper return: title, authors (as a list),
              publication date, abstract (first 2 sentences), and URL.""",
    source="https://arxiv.org/list/cs.AI/recent",
    config=graph_config,
)

result = scraper.run()

# Returns structured JSON matching your description
import json
print(json.dumps(result, indent=2))

The key advantage: when the website redesigns and changes its CSS classes, your scraper still works because it understands the content, not the structure.


Part 7: Keeping Your RAG Knowledge Base Fresh

A static scrape gets stale. Here's a lightweight refresh scheduler:

import asyncio
import json
from datetime import datetime, timezone, timedelta
import chromadb

REFRESH_INTERVAL_HOURS = 24
SOURCES = [
    "https://docs.python.org/3/",
    "https://playwright.dev/python/docs/intro",
    "https://docs.scrapy.org/en/latest/",
]

async def refresh_knowledge_base():
    """Re-scrape all sources and update ChromaDB with new content."""
    print(f"Starting knowledge base refresh at {datetime.now().isoformat()}")

    # Track which documents are new vs unchanged
    new_count = updated_count = 0

    for source_url in SOURCES:
        print(f"Re-crawling: {source_url}")
        pages = await crawl_documentation_site(source_url, max_pages=20)

        for page in pages:
            # Check if this URL was scraped recently
            existing = collection.get(
                where={"source_url": page["url"]},
                limit=1
            )
            if existing["ids"]:
                # Delete old chunks for this URL before re-adding
                collection.delete(where={"source_url": page["url"]})
                updated_count += 1
            else:
                new_count += 1

            chunks = chunk_markdown_page(page)
            embed_and_store(chunks)

    print(f"Refresh complete — {new_count} new pages, {updated_count} updated.")
    return {"new": new_count, "updated": updated_count}


# Schedule with asyncio
async def scheduled_refresh():
    while True:
        await refresh_knowledge_base()
        print(f"Next refresh in {REFRESH_INTERVAL_HOURS} hours.")
        await asyncio.sleep(REFRESH_INTERVAL_HOURS * 3600)

# asyncio.run(scheduled_refresh())   # Uncomment to run continuously

Choosing the Right Tool for Your AI Scraping Needs

Use case

Best tool

Reason

Scraping docs for RAG, self-hosted

Crawl4AI

Free, open-source, Markdown output

Scraping JS-heavy sites at scale

Firecrawl API

Handles anti-bot, clean Markdown

Extracting structured data (no selectors)

ScrapeGraphAI

Natural language prompts, JSON output

Feeding LangChain / LlamaIndex

Any + LangChain loaders

Direct integration available

Training dataset construction

httpx + Crawl4AI

Volume + low cost


Performance Numbers: What to Expect

For a typical documentation RAG pipeline:

Metric

Value

Crawl speed (Crawl4AI, 10 concurrent)

~8 pages/second

Markdown size vs raw HTML

~15% of original (85% reduction)

Embedding time (MiniLM-L6-v2, CPU)

~1,000 chunks/minute

ChromaDB query latency (100k docs)

~15ms

RAG answer latency (retrieval + LLM)

~1.2s average


FAQ

Q: Is it legal to scrape websites for AI training data? The legal landscape is complex and evolving rapidly. Publicly accessible data has generally been treated as fair game under the theory of implied licence, but several lawsuits (particularly against AI companies scraping copyrighted content) are ongoing. Always check robots.txt, the site's Terms of Service, and consult legal counsel for commercial applications.

Q: What's the difference between Crawl4AI and Firecrawl? Crawl4AI is open-source and self-hosted — you run it yourself, no API costs. Firecrawl is a managed API — you pay per request, but it handles infrastructure, JavaScript rendering, and anti-bot bypass automatically. For prototyping, Crawl4AI. For production at scale, Firecrawl.

Q: How many chunks should a RAG database have? For a focused domain (e.g., one product's docs), 5,000–50,000 chunks works well with ChromaDB. For large-scale knowledge bases (hundreds of websites), use managed vector DBs like Pinecone or Weaviate which are optimised for billions of vectors.

Q: What chunk size is best for RAG? 800–1,000 characters (~600–750 tokens) with 10–15% overlap is the empirically best-performing range for most embedding models. Smaller chunks (under 300 chars) lose context; larger chunks (over 1,500 chars) degrade retrieval precision.

Q: Can I use this approach with local LLMs? Absolutely. Replace the Anthropic API call with Ollama (for local Llama 3 or Mistral). The retrieval pipeline is identical; only the final generation step changes.


Summary

Step

Tool

What it does

1. Crawl

Crawl4AI / Firecrawl

Scrape pages → clean Markdown

2. Chunk

LangChain TextSplitter

Split into retrieval-optimised segments

3. Embed

sentence-transformers

Convert text to semantic vectors

4. Store

ChromaDB

Index and persist embeddings

5. Retrieve

ChromaDB query

Find most relevant chunks for a question

6. Generate

Claude / GPT-4

Answer grounded in retrieved context

7. Refresh

asyncio scheduler

Keep knowledge base current

Five years ago, web scraping was mostly associated with lead generation, price monitoring, and SEO tooling. In 2026, its role is much larger. Every RAG system, AI research assistant, internal knowledge bot, and real-time LLM application needs a way to collect fresh information. Models provide reasoning. Scrapers provide knowledge.

The developers who understand both will have a significant advantage over those who only understand one side of the stack.

Z

ZyVOP

Passionate developer sharing knowledge about modern web technologies and best practices.

Comments (0)

Login to post a comment.

Table of Contents

Why Every AI Developer Needs to Know How to Scrape the WebWhat Is a RAG Pipeline and Why Does It Need Web Data?The AI Scraping Stack in 2026Part 1: Crawl4AI — The Open-Source RAG ScraperInstallationBasic usage: scrape a single page to clean MarkdownCrawling an entire documentation sitePart 2: Cleaning and Chunking for Optimal RAG PerformancePart 3: Embedding and Storing in ChromaDBPart 4: Querying — The RAG Retrieval LoopPart 5: The Full RAG Answer PipelinePart 6: Using ScrapeGraphAI for Natural Language ExtractionPart 7: Keeping Your RAG Knowledge Base FreshChoosing the Right Tool for Your AI Scraping NeedsPerformance Numbers: What to ExpectFAQSummary

Stay Updated

Get the latest articles delivered to your inbox.

We respect your privacy. Unsubscribe anytime.

Related Posts

LinkedIn Scraping with Python: Profiles, Jobs & Company Pages

LinkedIn is one of the most valuable and difficult websites to scrape. This guide covers Playwright, session cookies, stealth techniques, profile extraction, job scraping, company data collection, rate limiting, and when to use LinkedIn's official API instead.

Read article

Web Scraping with Python: A Complete BeautifulSoup & Requests Guide

Web scraping turns raw HTML into structured data. This guide teaches Python scraping with Requests and BeautifulSoup, covering HTTP requests, HTML parsing, CSS selectors, pagination, retries, robots.txt, data export, and a production-ready scraper.

Read article

Beyond Autocomplete: How AI Editors Actually Understand Your Codebase

Modern AI editors don't guess — they retrieve. Before the model sees a single token of your query, a RAG pipeline has already searched your entire repo, a semantic graph has mapped every function relationship, and Tree-sitter has locked down the structural ground truth. Here's the full stack, with code.

Read article

Popular Tags

#.env.example Node.js#0x profiling#12-factor#2026#AI#AI agents#AI code security#AI coding tools 2026#AI-assisted development#AI-generated vulnerabilities