ZyVOP Logo
Content That Connects
SeriesCategoriesTags
ZyVOP Logo
Content That Connects

Empowering developers and creators with cutting-edge insights, comprehensive tutorials, and innovative solutions for the digital future.

Content

  • Tags
  • Write Article

Company

  • About Us
  • Contact

Connect

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • DMCA Policy
  • Code of Conduct

© 2026 ZyVOP. Crafted with care for the developer community.

Made with ❤️ by the ZyVOP team
All systems operational
HomeWeb Scraping with Python: A Complete BeautifulSoup & Requests Guide

Web Scraping with Python: A Complete BeautifulSoup & Requests Guide

Learn how to build reliable web scrapers with Python, extract structured data from HTML, handle pagination and errors, and export results to CSV.

#Python#Web Scraping#BeautifulSoup#Requests#HTML Parsing#Data Extraction#Web Crawling#Pandas#automation#Data Collection
Z
ZyVOP

Senior Developer

June 2, 2026
10 min read
6 views
Web Scraping with Python: A Complete BeautifulSoup & Requests Guide

Every day, billions of web pages sit on the internet — full of prices, headlines, job listings, research data, and more. Most of it has no official API. Web scraping is how you collect that data programmatically, turning raw HTML into clean, structured datasets you can actually use.

Python is the gold standard for web scraping. It has a rich ecosystem, readable syntax, and two libraries in particular that make scraping feel almost effortless: requests (for fetching web pages) and BeautifulSoup (for parsing them).

By the end of this guide, you will:

  • Understand how HTTP requests and HTML parsing work together

  • Write a scraper that collects data from real websites

  • Handle pagination, headers, and common errors

  • Export your data to CSV using pandas

Let's dig in.


How Web Scraping Works

When you type a URL into a browser, your browser sends an HTTP GET request to a server. The server responds with HTML. Your browser renders that HTML into the visual page you see.

Web scraping does the same thing — but instead of a browser rendering the HTML visually, Python reads it programmatically and extracts exactly the data you want.

Your Script  →  HTTP GET Request  →  Web Server
Web Server   →  HTML Response     →  Your Script
Your Script  →  Parse HTML        →  Structured Data

There are two key parts:

  • requests handles the first half: sending the HTTP request and receiving the HTML

  • BeautifulSoup handles the second half: parsing that HTML so you can navigate and extract from it


Installation

Install all required libraries with a single pip command:

pip install requests beautifulsoup4 pandas lxml

Why lxml? BeautifulSoup supports multiple parsers. lxml is the fastest and most lenient — it handles malformed HTML gracefully, which is important because real-world HTML is often messy.


Your First Scraper: Fetching a Page

Let's start simple. Here is how to fetch the HTML of any webpage:

import requests

url = "https://books.toscrape.com/"

# A User-Agent tells the server what kind of client is making the request.
# Without this, many servers will block or return a different response.
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers, timeout=10)

# Always check the status code before parsing
print(response.status_code)   # 200 = success
print(len(response.text))     # Length of the HTML string

About status codes:

Code

Meaning

200

Success

301/302

Redirect (requests follows these automatically)

403

Forbidden — you're being blocked

404

Page not found

429

Too many requests — you're being rate-limited

500

Server error

If you get a 403, your User-Agent is probably missing or being rejected. If you get a 429, you are scraping too fast.


Parsing HTML with BeautifulSoup

Once you have the HTML string, you pass it to BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "lxml")

# The `soup` object now represents the entire HTML document.
# You can navigate it like a tree.
print(soup.title.text)         # Page title
print(soup.find("h1").text)    # First h1 on the page

BeautifulSoup gives you several ways to find elements:

Method 1: find() — returns the first match

# Find the first element with tag <h2>
heading = soup.find("h2")

# Find the first element with a specific class
box = soup.find("div", class_="product-box")

# Find by ID
sidebar = soup.find("div", id="sidebar")

Method 2: find_all() — returns a list of all matches

# Find ALL <a> tags
all_links = soup.find_all("a")

# Iterate and extract
for link in all_links:
    print(link.text, link.get("href"))

Method 3: CSS Selectors with select() — the most powerful

If you know CSS, you already know this. .select() accepts any CSS selector string.

# All elements with class "product_pod"
products = soup.select("article.product_pod")

# The first anchor inside elements with class "titleline"
title_links = soup.select(".titleline a")

# Nested selectors — p tags inside div.content
paragraphs = soup.select("div.content p")

# select_one() is like find() but uses CSS syntax
price = soup.select_one(".price_color")

Tip: Use your browser's DevTools to get selectors instantly. Right-click any element → Inspect → Right-click the highlighted HTML → Copy → Copy selector.


Real Example: Scraping Book Data

books.toscrape.com is a sandbox website built specifically for scraping practice. Let's scrape its catalog.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"
}

def parse_rating(class_string):
    """Convert word-based star rating to number."""
    rating_map = {
        "One": 1, "Two": 2, "Three": 3,
        "Four": 4, "Five": 5
    }
    # class_string looks like "star-rating Three"
    word = class_string.split()[-1]
    return rating_map.get(word, 0)

def scrape_page(url):
    """Scrape all books from a single catalogue page."""
    response = requests.get(url, headers=HEADERS, timeout=10)
    response.raise_for_status()  # raises exception on 4xx/5xx

    soup = BeautifulSoup(response.text, "lxml")
    books = []

    for article in soup.select("article.product_pod"):
        title = article.select_one("h3 a")["title"]
        price = article.select_one(".price_color").text.strip()
        rating_class = article.select_one(".star-rating")["class"]
        rating = parse_rating(" ".join(rating_class))
        in_stock = "In stock" in article.select_one(".availability").text

        books.append({
            "title": title,
            "price": price,
            "rating": rating,
            "in_stock": in_stock
        })

    return books

def scrape_catalog(pages=5):
    """Scrape multiple pages with polite delays."""
    base_url = "https://books.toscrape.com/catalogue/page-{}.html"
    all_books = []

    for page_num in range(1, pages + 1):
        url = base_url.format(page_num)
        print(f"Scraping page {page_num}...")

        page_books = scrape_page(url)
        all_books.extend(page_books)

        time.sleep(1.5)  # Be polite — don't hammer the server

    return pd.DataFrame(all_books)

# Run the scraper
df = scrape_catalog(pages=10)
print(f"Scraped {len(df)} books")
print(df.head())

# Save to CSV
df.to_csv("books.csv", index=False)

Sample output:

Scraped 200 books
                                           title  price  rating  in_stock
0                          A Light in the Attic  £51.77       3      True
1                            Tipping the Velvet  £53.74       1      True
2                                    Soumission  £50.10       1      True
...

Handling Pagination Automatically

The previous example used hard-coded page numbers. A better approach is to follow "Next" links dynamically — this way your scraper adapts to any number of pages.

from urllib.parse import urljoin

def scrape_all_pages(start_url):
    """Follow pagination links until there are no more pages."""
    all_books = []
    current_url = start_url

    while current_url:
        print(f"Scraping: {current_url}")
        response = requests.get(current_url, headers=HEADERS, timeout=10)
        soup = BeautifulSoup(response.text, "lxml")

        # Scrape current page
        for article in soup.select("article.product_pod"):
            title = article.select_one("h3 a")["title"]
            price = article.select_one(".price_color").text.strip()
            all_books.append({"title": title, "price": price})

        # Find the "next" button — returns None if we're on the last page
        next_btn = soup.select_one("li.next a")
        if next_btn:
            # Build the absolute URL from the relative href
            current_url = urljoin(current_url, next_btn["href"])
        else:
            current_url = None  # No more pages, stop the loop

        time.sleep(1)

    return pd.DataFrame(all_books)

df = scrape_all_pages("https://books.toscrape.com/catalogue/page-1.html")
print(f"Total books scraped: {len(df)}")

This pattern works for virtually any paginated website — product listings, news archives, search results.


Extracting Common Data Types

Extracting text

# .text gives raw text including whitespace
raw = element.text

# .get_text(strip=True) is cleaner
clean = element.get_text(strip=True)

# .get_text(separator=", ") joins multiple text nodes
joined = element.get_text(separator=", ")

Extracting attributes

# Get the href from a link
url = soup.find("a")["href"]
url = soup.find("a").get("href")  # safer — returns None instead of KeyError

# Get the src from an image
img_src = soup.find("img").get("src")

# Get a data attribute
product_id = element.get("data-product-id")

Extracting tables

HTML tables are tedious to parse manually. pandas does it in one line:

import pandas as pd

# pd.read_html() returns a list of all tables on the page as DataFrames
tables = pd.read_html(response.text)
df = tables[0]  # first table on the page
print(df)

Handling Errors Gracefully

Real-world scraping always involves errors — network timeouts, missing elements, rate limiting. Here is a robust error-handling pattern:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session():
    """Create a session with automatic retries on network errors."""
    session = requests.Session()
    session.headers.update({"User-Agent": "Mozilla/5.0"})

    # Retry up to 3 times on connection errors and 500/502/503/504
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,           # Wait 1s, 2s, 4s between retries
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session

def safe_get_text(element, selector, default="N/A"):
    """Extract text from a CSS selector, with a fallback default."""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

# Usage
session = create_session()

try:
    response = session.get("https://example.com", timeout=15)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

    title = safe_get_text(soup, "h1")
    price = safe_get_text(soup, ".price", default="Price not found")

except requests.exceptions.Timeout:
    print("Request timed out")
except requests.exceptions.HTTPError as e:
    print(f"HTTP error: {e.response.status_code}")
except requests.exceptions.ConnectionError:
    print("Could not connect to the server")

Respecting robots.txt

Before scraping any site, check its robots.txt file. This file, always located at domain.com/robots.txt, specifies which paths are off-limits for bots.

import urllib.robotparser

def is_allowed(url):
    """Check if robots.txt permits scraping this URL."""
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch("*", url)

print(is_allowed("https://books.toscrape.com/"))  # True

Ignoring robots.txt is considered impolite and can have legal implications depending on your jurisdiction and the site's Terms of Service.


Exporting Data

To CSV

df.to_csv("output.csv", index=False, encoding="utf-8-sig")
# utf-8-sig adds a BOM that makes Excel read accented characters correctly

To JSON

df.to_json("output.json", orient="records", indent=2, force_ascii=False)

To SQLite

import sqlite3

conn = sqlite3.connect("scraping_results.db")
df.to_sql("books", conn, if_exists="replace", index=False)
conn.close()

A Complete, Production-Ready Scraper

Here is the complete, polished version combining everything above:

import requests
import pandas as pd
import time
import logging
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)

class BookScraper:
    BASE_URL = "https://books.toscrape.com/catalogue/page-1.html"
    HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"}
    DELAY = 1.5  # seconds between requests

    RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

    def __init__(self):
        self.session = self._create_session()

    def _create_session(self):
        session = requests.Session()
        session.headers.update(self.HEADERS)
        retries = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503])
        session.mount("https://", HTTPAdapter(max_retries=retries))
        return session

    def _fetch(self, url):
        response = self.session.get(url, timeout=15)
        response.raise_for_status()
        return BeautifulSoup(response.text, "lxml")

    def _parse_book(self, article):
        title = article.select_one("h3 a").get("title", "Unknown")
        price = article.select_one(".price_color").get_text(strip=True)
        rating_word = article.select_one(".star-rating")["class"][-1]
        rating = self.RATING_MAP.get(rating_word, 0)
        in_stock = "In stock" in article.select_one(".availability").text
        return {"title": title, "price": price, "rating": rating, "in_stock": in_stock}

    def scrape(self):
        all_books = []
        current_url = self.BASE_URL

        while current_url:
            logger.info(f"Scraping: {current_url}")
            soup = self._fetch(current_url)

            for article in soup.select("article.product_pod"):
                all_books.append(self._parse_book(article))

            next_btn = soup.select_one("li.next a")
            current_url = urljoin(current_url, next_btn["href"]) if next_btn else None
            time.sleep(self.DELAY)

        logger.info(f"Done. Scraped {len(all_books)} books.")
        return pd.DataFrame(all_books)

if __name__ == "__main__":
    scraper = BookScraper()
    df = scraper.scrape()
    df.to_csv("all_books.csv", index=False)
    print(df.describe())

Common Pitfalls and How to Avoid Them

1. Missing User-Agent Many servers return a 403 or a bot-detection page if no User-Agent header is set. Always include one that mimics a real browser.

2. Not handling missing elements If a single product is missing its price tag, calling .text on None will crash your entire scraper. Always use .get_text() on find() results with a None check, or use the safe_get_text() helper pattern shown earlier.

3. Scraping too fast Without delays, you can overwhelm small servers, get IP-banned, or cause real harm. A delay of 1–2 seconds between requests is standard practice. For large jobs, use asyncio (covered in Blog 03).

4. Ignoring encoding Some sites serve Latin-1 or Windows-1252 encoded pages. If your text looks garbled, check response.encoding and set it explicitly: response.encoding = "utf-8".

5. Parsing JavaScript-rendered content BeautifulSoup only parses the raw HTML sent by the server. If the data you need is loaded by JavaScript after the page renders, BeautifulSoup cannot see it. You need Selenium or Playwright for those cases (covered in Blog 02).


What to Learn Next

You now have a solid foundation in synchronous scraping with BeautifulSoup and requests. The natural next steps are:

  • Dynamic pages (JavaScript-rendered): Move to Playwright or Selenium when the data is loaded by JS

  • Async scraping: Use httpx + asyncio to scrape 10x faster (Blog 03)

  • Anti-bot evasion: Learn how sites detect scrapers and how to avoid detection (Blog 04)

  • Production pipelines: Use Scrapy for large-scale, fault-tolerant crawling (Blog 05)


Summary

Concept

What you learned

HTTP basics

requests.get(), status codes, headers

Parsing

BeautifulSoup, find(), find_all(), select()

Navigation

CSS selectors, attribute extraction, text extraction

Pagination

Following next-page links dynamically

Error handling

Retry sessions, safe element access

Data export

CSV, JSON, SQLite via pandas

Ethics

robots.txt, rate limiting, Terms of Service

Web scraping is a superpower. Use it responsibly.

Z

ZyVOP

Passionate developer sharing knowledge about modern web technologies and best practices.

Comments (0)

Login to post a comment.

Table of Contents

How Web Scraping WorksInstallationYour First Scraper: Fetching a PageParsing HTML with BeautifulSoupMethod 1: find() — returns the first matchMethod 2: find_all() — returns a list of all matchesMethod 3: CSS Selectors with select() — the most powerfulReal Example: Scraping Book DataHandling Pagination AutomaticallyExtracting Common Data TypesExtracting textExtracting attributesExtracting tablesHandling Errors GracefullyRespecting robots.txtExporting DataTo CSVTo JSONTo SQLiteA Complete, Production-Ready ScraperCommon Pitfalls and How to Avoid ThemWhat to Learn NextSummary

Stay Updated

Get the latest articles delivered to your inbox.

We respect your privacy. Unsubscribe anytime.

Related Posts

LinkedIn Scraping with Python: Profiles, Jobs & Company Pages

LinkedIn is one of the most valuable and difficult websites to scrape. This guide covers Playwright, session cookies, stealth techniques, profile extraction, job scraping, company data collection, rate limiting, and when to use LinkedIn's official API instead.

Read article

Automate Your Code Quality with Git Hooks (And Never Argue in Code Review Again)

Most code review comments should never require a reviewer. This guide shows how to automate formatting, linting, staged-file checks, and commit message validation using Git hooks, Husky, lint-staged, and commitlint before bad code ever reaches your repository.

Read article

Popular Tags

#.env.example Node.js#0x profiling#12-factor#2026#AI agents#AI code security#AI coding tools 2026#AI-assisted development#AI-generated vulnerabilities#ALTER TABLE no lock