Web Scraping with Python: A Complete BeautifulSoup & Requests Guide
Learn how to build reliable web scrapers with Python, extract structured data from HTML, handle pagination and errors, and export results to CSV.
Senior Developer

Every day, billions of web pages sit on the internet — full of prices, headlines, job listings, research data, and more. Most of it has no official API. Web scraping is how you collect that data programmatically, turning raw HTML into clean, structured datasets you can actually use.
Python is the gold standard for web scraping. It has a rich ecosystem, readable syntax, and two libraries in particular that make scraping feel almost effortless: requests (for fetching web pages) and BeautifulSoup (for parsing them).
By the end of this guide, you will:
Understand how HTTP requests and HTML parsing work together
Write a scraper that collects data from real websites
Handle pagination, headers, and common errors
Export your data to CSV using pandas
Let's dig in.
How Web Scraping Works
When you type a URL into a browser, your browser sends an HTTP GET request to a server. The server responds with HTML. Your browser renders that HTML into the visual page you see.
Web scraping does the same thing — but instead of a browser rendering the HTML visually, Python reads it programmatically and extracts exactly the data you want.
Your Script → HTTP GET Request → Web Server
Web Server → HTML Response → Your Script
Your Script → Parse HTML → Structured DataThere are two key parts:
requests handles the first half: sending the HTTP request and receiving the HTML
BeautifulSoup handles the second half: parsing that HTML so you can navigate and extract from it
Installation
Install all required libraries with a single pip command:
pip install requests beautifulsoup4 pandas lxmlWhy lxml? BeautifulSoup supports multiple parsers. lxml is the fastest and most lenient — it handles malformed HTML gracefully, which is important because real-world HTML is often messy.
Your First Scraper: Fetching a Page
Let's start simple. Here is how to fetch the HTML of any webpage:
import requests
url = "https://books.toscrape.com/"
# A User-Agent tells the server what kind of client is making the request.
# Without this, many servers will block or return a different response.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers, timeout=10)
# Always check the status code before parsing
print(response.status_code) # 200 = success
print(len(response.text)) # Length of the HTML stringAbout status codes:
Code | Meaning |
|---|---|
200 | Success |
301/302 | Redirect (requests follows these automatically) |
403 | Forbidden — you're being blocked |
404 | Page not found |
429 | Too many requests — you're being rate-limited |
500 | Server error |
If you get a 403, your User-Agent is probably missing or being rejected. If you get a 429, you are scraping too fast.
Parsing HTML with BeautifulSoup
Once you have the HTML string, you pass it to BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "lxml")
# The `soup` object now represents the entire HTML document.
# You can navigate it like a tree.
print(soup.title.text) # Page title
print(soup.find("h1").text) # First h1 on the pageBeautifulSoup gives you several ways to find elements:
Method 1: find() — returns the first match
# Find the first element with tag <h2>
heading = soup.find("h2")
# Find the first element with a specific class
box = soup.find("div", class_="product-box")
# Find by ID
sidebar = soup.find("div", id="sidebar")Method 2: find_all() — returns a list of all matches
# Find ALL <a> tags
all_links = soup.find_all("a")
# Iterate and extract
for link in all_links:
print(link.text, link.get("href"))Method 3: CSS Selectors with select() — the most powerful
If you know CSS, you already know this. .select() accepts any CSS selector string.
# All elements with class "product_pod"
products = soup.select("article.product_pod")
# The first anchor inside elements with class "titleline"
title_links = soup.select(".titleline a")
# Nested selectors — p tags inside div.content
paragraphs = soup.select("div.content p")
# select_one() is like find() but uses CSS syntax
price = soup.select_one(".price_color")Tip: Use your browser's DevTools to get selectors instantly. Right-click any element → Inspect → Right-click the highlighted HTML → Copy → Copy selector.
Real Example: Scraping Book Data
books.toscrape.com is a sandbox website built specifically for scraping practice. Let's scrape its catalog.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"
}
def parse_rating(class_string):
"""Convert word-based star rating to number."""
rating_map = {
"One": 1, "Two": 2, "Three": 3,
"Four": 4, "Five": 5
}
# class_string looks like "star-rating Three"
word = class_string.split()[-1]
return rating_map.get(word, 0)
def scrape_page(url):
"""Scrape all books from a single catalogue page."""
response = requests.get(url, headers=HEADERS, timeout=10)
response.raise_for_status() # raises exception on 4xx/5xx
soup = BeautifulSoup(response.text, "lxml")
books = []
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one(".price_color").text.strip()
rating_class = article.select_one(".star-rating")["class"]
rating = parse_rating(" ".join(rating_class))
in_stock = "In stock" in article.select_one(".availability").text
books.append({
"title": title,
"price": price,
"rating": rating,
"in_stock": in_stock
})
return books
def scrape_catalog(pages=5):
"""Scrape multiple pages with polite delays."""
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []
for page_num in range(1, pages + 1):
url = base_url.format(page_num)
print(f"Scraping page {page_num}...")
page_books = scrape_page(url)
all_books.extend(page_books)
time.sleep(1.5) # Be polite — don't hammer the server
return pd.DataFrame(all_books)
# Run the scraper
df = scrape_catalog(pages=10)
print(f"Scraped {len(df)} books")
print(df.head())
# Save to CSV
df.to_csv("books.csv", index=False)Sample output:
Scraped 200 books
title price rating in_stock
0 A Light in the Attic £51.77 3 True
1 Tipping the Velvet £53.74 1 True
2 Soumission £50.10 1 True
...Handling Pagination Automatically
The previous example used hard-coded page numbers. A better approach is to follow "Next" links dynamically — this way your scraper adapts to any number of pages.
from urllib.parse import urljoin
def scrape_all_pages(start_url):
"""Follow pagination links until there are no more pages."""
all_books = []
current_url = start_url
while current_url:
print(f"Scraping: {current_url}")
response = requests.get(current_url, headers=HEADERS, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
# Scrape current page
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one(".price_color").text.strip()
all_books.append({"title": title, "price": price})
# Find the "next" button — returns None if we're on the last page
next_btn = soup.select_one("li.next a")
if next_btn:
# Build the absolute URL from the relative href
current_url = urljoin(current_url, next_btn["href"])
else:
current_url = None # No more pages, stop the loop
time.sleep(1)
return pd.DataFrame(all_books)
df = scrape_all_pages("https://books.toscrape.com/catalogue/page-1.html")
print(f"Total books scraped: {len(df)}")This pattern works for virtually any paginated website — product listings, news archives, search results.
Extracting Common Data Types
Extracting text
# .text gives raw text including whitespace
raw = element.text
# .get_text(strip=True) is cleaner
clean = element.get_text(strip=True)
# .get_text(separator=", ") joins multiple text nodes
joined = element.get_text(separator=", ")Extracting attributes
# Get the href from a link
url = soup.find("a")["href"]
url = soup.find("a").get("href") # safer — returns None instead of KeyError
# Get the src from an image
img_src = soup.find("img").get("src")
# Get a data attribute
product_id = element.get("data-product-id")Extracting tables
HTML tables are tedious to parse manually. pandas does it in one line:
import pandas as pd
# pd.read_html() returns a list of all tables on the page as DataFrames
tables = pd.read_html(response.text)
df = tables[0] # first table on the page
print(df)Handling Errors Gracefully
Real-world scraping always involves errors — network timeouts, missing elements, rate limiting. Here is a robust error-handling pattern:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session():
"""Create a session with automatic retries on network errors."""
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0"})
# Retry up to 3 times on connection errors and 500/502/503/504
retry_strategy = Retry(
total=3,
backoff_factor=1, # Wait 1s, 2s, 4s between retries
status_forcelist=[500, 502, 503, 504],
allowed_methods=["GET"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def safe_get_text(element, selector, default="N/A"):
"""Extract text from a CSS selector, with a fallback default."""
found = element.select_one(selector)
return found.get_text(strip=True) if found else default
# Usage
session = create_session()
try:
response = session.get("https://example.com", timeout=15)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
title = safe_get_text(soup, "h1")
price = safe_get_text(soup, ".price", default="Price not found")
except requests.exceptions.Timeout:
print("Request timed out")
except requests.exceptions.HTTPError as e:
print(f"HTTP error: {e.response.status_code}")
except requests.exceptions.ConnectionError:
print("Could not connect to the server")Respecting robots.txt
Before scraping any site, check its robots.txt file. This file, always located at domain.com/robots.txt, specifies which paths are off-limits for bots.
import urllib.robotparser
def is_allowed(url):
"""Check if robots.txt permits scraping this URL."""
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch("*", url)
print(is_allowed("https://books.toscrape.com/")) # TrueIgnoring robots.txt is considered impolite and can have legal implications depending on your jurisdiction and the site's Terms of Service.
Exporting Data
To CSV
df.to_csv("output.csv", index=False, encoding="utf-8-sig")
# utf-8-sig adds a BOM that makes Excel read accented characters correctlyTo JSON
df.to_json("output.json", orient="records", indent=2, force_ascii=False)To SQLite
import sqlite3
conn = sqlite3.connect("scraping_results.db")
df.to_sql("books", conn, if_exists="replace", index=False)
conn.close()A Complete, Production-Ready Scraper
Here is the complete, polished version combining everything above:
import requests
import pandas as pd
import time
import logging
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)
class BookScraper:
BASE_URL = "https://books.toscrape.com/catalogue/page-1.html"
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"}
DELAY = 1.5 # seconds between requests
RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
def __init__(self):
self.session = self._create_session()
def _create_session(self):
session = requests.Session()
session.headers.update(self.HEADERS)
retries = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503])
session.mount("https://", HTTPAdapter(max_retries=retries))
return session
def _fetch(self, url):
response = self.session.get(url, timeout=15)
response.raise_for_status()
return BeautifulSoup(response.text, "lxml")
def _parse_book(self, article):
title = article.select_one("h3 a").get("title", "Unknown")
price = article.select_one(".price_color").get_text(strip=True)
rating_word = article.select_one(".star-rating")["class"][-1]
rating = self.RATING_MAP.get(rating_word, 0)
in_stock = "In stock" in article.select_one(".availability").text
return {"title": title, "price": price, "rating": rating, "in_stock": in_stock}
def scrape(self):
all_books = []
current_url = self.BASE_URL
while current_url:
logger.info(f"Scraping: {current_url}")
soup = self._fetch(current_url)
for article in soup.select("article.product_pod"):
all_books.append(self._parse_book(article))
next_btn = soup.select_one("li.next a")
current_url = urljoin(current_url, next_btn["href"]) if next_btn else None
time.sleep(self.DELAY)
logger.info(f"Done. Scraped {len(all_books)} books.")
return pd.DataFrame(all_books)
if __name__ == "__main__":
scraper = BookScraper()
df = scraper.scrape()
df.to_csv("all_books.csv", index=False)
print(df.describe())Common Pitfalls and How to Avoid Them
1. Missing User-Agent Many servers return a 403 or a bot-detection page if no User-Agent header is set. Always include one that mimics a real browser.
2. Not handling missing elements If a single product is missing its price tag, calling .text on None will crash your entire scraper. Always use .get_text() on find() results with a None check, or use the safe_get_text() helper pattern shown earlier.
3. Scraping too fast Without delays, you can overwhelm small servers, get IP-banned, or cause real harm. A delay of 1–2 seconds between requests is standard practice. For large jobs, use asyncio (covered in Blog 03).
4. Ignoring encoding Some sites serve Latin-1 or Windows-1252 encoded pages. If your text looks garbled, check response.encoding and set it explicitly: response.encoding = "utf-8".
5. Parsing JavaScript-rendered content BeautifulSoup only parses the raw HTML sent by the server. If the data you need is loaded by JavaScript after the page renders, BeautifulSoup cannot see it. You need Selenium or Playwright for those cases (covered in Blog 02).
What to Learn Next
You now have a solid foundation in synchronous scraping with BeautifulSoup and requests. The natural next steps are:
Dynamic pages (JavaScript-rendered): Move to Playwright or Selenium when the data is loaded by JS
Async scraping: Use
httpx+asyncioto scrape 10x faster (Blog 03)Anti-bot evasion: Learn how sites detect scrapers and how to avoid detection (Blog 04)
Production pipelines: Use Scrapy for large-scale, fault-tolerant crawling (Blog 05)
Summary
Concept | What you learned |
|---|---|
HTTP basics | requests.get(), status codes, headers |
Parsing | BeautifulSoup, find(), find_all(), select() |
Navigation | CSS selectors, attribute extraction, text extraction |
Pagination | Following next-page links dynamically |
Error handling | Retry sessions, safe element access |
Data export | CSV, JSON, SQLite via pandas |
Ethics | robots.txt, rate limiting, Terms of Service |
Web scraping is a superpower. Use it responsibly.
Comments (0)
Login to post a comment.