Which topics does this article cover?

It highlights python async web scraping tutorial 2026, fast python scraper asyncio httpx, concurrent http requests python, asyncio semaphore web scraping, 10x faster python scraper tutorial.

Async Scraping with Python: httpx & asyncio — Build a 10x Faster Scraper

Q: What is "Async Scraping with Python: httpx & asyncio — Build a 10x Faster Scraper" about?

Stop waiting for requests to finish one by one. This guide shows you how to scrape 1,000 pages in under 80 seconds using httpx and asyncio, with semaphore throttling and exponential backoff.

Q: What will I learn from this article?

You've mastered the basics with requests and BeautifulSoup. Now imagine you need to scrape 5,000 product pages.

You've mastered the basics with requests and BeautifulSoup. Now imagine you need to scrape 5,000 product pages. With synchronous scraping, each request waits for the previous one to finish. At 1.5 seconds per page, that's over two hours of runtime.

Async scraping changes the equation entirely. Instead of waiting idle while the server thinks, Python fires off dozens of requests simultaneously — and your 5,000-page scrape finishes in under ten minutes.

This isn't just a minor optimization. For any scraping job with more than a few hundred pages, async is the difference between a script you run overnight and one you run before your coffee cools.

By the end of this guide, you will:

Understand the difference between synchronous, threaded, and async I/O
Build a production-grade async scraper with httpx and asyncio
Control concurrency with semaphores to avoid getting banned
Handle errors, retries, and progress tracking
Save results safely from concurrent tasks

Why Not Just Use Threads?

A fair question. Python's threading module can also parallelize requests. So why asyncio?

Threading problems:

Python's Global Interpreter Lock (GIL) limits true parallelism
Each thread costs ~8MB of memory — 500 threads = 4GB RAM
Race conditions are subtle and hard to debug
Thread management overhead adds up

Asyncio advantages:

A single thread handles thousands of concurrent operations
Memory footprint is tiny — a coroutine costs ~1KB
No GIL contention for I/O-bound tasks
Explicit control flow with await — easier to reason about

For I/O-bound work (which scraping is — you spend most time waiting for network responses), asyncio consistently outperforms threading.

Rule of thumb: Use asyncio for I/O-bound tasks (HTTP, file reads, databases). Use multiprocessing for CPU-bound tasks (parsing large HTML, image processing).

How Async I/O Works

The key mental model: instead of blocking while waiting for a response, an async function yields control back to the event loop, which runs other tasks in the meantime.

Sync:    [req1 wait...] → [req2 wait...] → [req3 wait...]
          ← 4.5 seconds total →

Async:   [req1]→→→[req2]→→→[req3]
              ←responses arrive→
          ← ~1.5 seconds total →

Python's asyncio module provides the event loop. httpx is an HTTP client built from the ground up for async — unlike requests, which is purely synchronous.

Installation

pip install httpx[http2] beautifulsoup4 aiofiles tqdm

The [http2] extra enables HTTP/2 support, which can further speed up scraping on sites that support it.

The Async Basics

Before building the full scraper, let's understand the core async primitives:

import asyncio
import httpx

# An async function is defined with "async def"
# It returns a coroutine object, not a result

async def fetch_page(url: str) -> str:
    async with httpx.AsyncClient() as client:
        # "await" suspends this coroutine until the response arrives
        # Other coroutines can run during this wait
        response = await client.get(url)
        return response.text

# asyncio.run() starts the event loop and runs one coroutine
html = asyncio.run(fetch_page("https://books.toscrape.com/"))
print(len(html))

Running multiple coroutines at once with asyncio.gather():

async def main():
    urls = [
        "https://books.toscrape.com/catalogue/page-1.html",
        "https://books.toscrape.com/catalogue/page-2.html",
        "https://books.toscrape.com/catalogue/page-3.html",
    ]

    async with httpx.AsyncClient() as client:
        # gather() runs all coroutines concurrently
        results = await asyncio.gather(*[
            client.get(url) for url in urls
        ])

    for r in results:
        print(r.status_code, r.url)

asyncio.run(main())

All three requests fire nearly simultaneously. The total time is approximately the time of the slowest single request — not the sum of all three.

Controlling Concurrency with Semaphores

Firing 5,000 requests simultaneously will get your IP banned instantly — and might actually crash your own machine. asyncio.Semaphore is the tool for setting a ceiling on concurrent requests:

import asyncio
import httpx

# Only 15 requests can be in-flight at any given moment
SEMAPHORE = asyncio.Semaphore(15)

async def fetch_with_limit(client: httpx.AsyncClient, url: str) -> tuple[str, str | None]:
    """Fetch a URL, respecting the concurrency limit."""
    async with SEMAPHORE:  # Blocks if 15 requests are already running
        try:
            response = await client.get(url, timeout=15.0)
            response.raise_for_status()
            return url, response.text
        except (httpx.HTTPError, httpx.TimeoutException) as e:
            print(f"Failed: {url} — {type(e).__name__}")
            return url, None

Choosing the right semaphore value:

Target site	Safe concurrency
Small blogs / personal sites	3–5
Mid-size e-commerce	8–15
Large CDN-backed sites	20–50
APIs with explicit rate limits	Calculate from the limit (e.g. 100 req/min = ~1.6/sec)

Start conservative and increase only if you're not seeing 429 responses.

Adding Retries with Exponential Backoff

Network errors and transient failures are inevitable. A robust scraper retries failed requests with increasing delays:

import asyncio
import httpx
import random

SEMAPHORE = asyncio.Semaphore(15)

async def fetch_with_retry(
    client: httpx.AsyncClient,
    url: str,
    max_retries: int = 3
) -> tuple[str, str | None]:
    """Fetch with exponential backoff retries."""

    async with SEMAPHORE:
        for attempt in range(max_retries):
            try:
                # Add a small random delay to avoid synchronized bursts
                await asyncio.sleep(random.uniform(0.1, 0.5))

                response = await client.get(url, timeout=20.0)

                if response.status_code == 429:
                    # Rate limited — wait longer and retry
                    wait_time = (2 ** attempt) + random.uniform(0, 1)
                    print(f"Rate limited on {url}. Waiting {wait_time:.1f}s...")
                    await asyncio.sleep(wait_time)
                    continue

                response.raise_for_status()
                return url, response.text

            except httpx.TimeoutException:
                wait = 2 ** attempt
                print(f"Timeout on {url} (attempt {attempt+1}). Retrying in {wait}s...")
                await asyncio.sleep(wait)

            except httpx.HTTPStatusError as e:
                if e.response.status_code < 500:
                    # 4xx errors won't be fixed by retrying
                    return url, None
                await asyncio.sleep(2 ** attempt)

            except httpx.RequestError:
                await asyncio.sleep(2 ** attempt)

        print(f"Permanently failed: {url}")
        return url, None

Progress Tracking with tqdm

Async scrapers are harder to monitor because tasks finish out of order. tqdm solves this:

import asyncio
import httpx
from tqdm.asyncio import tqdm

async def scrape_all_with_progress(urls: list[str]) -> dict[str, str | None]:
    """Scrape all URLs with a live progress bar."""

    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; AsyncScraper/1.0)"
    }

    limits = httpx.Limits(
        max_keepalive_connections=20,
        max_connections=30
    )

    async with httpx.AsyncClient(headers=headers, limits=limits) as client:
        tasks = [fetch_with_retry(client, url) for url in urls]

        # tqdm.gather() shows progress as tasks complete
        results = await tqdm.gather(*tasks, desc="Scraping", unit="page")

    return {url: html for url, html in results if html}

Parsing Results Concurrently

Parsing HTML is CPU-bound, but since BeautifulSoup is fast for individual pages, running it inside async functions is fine. For very heavy parsing, offload to a thread pool:

import asyncio
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup

THREAD_POOL = ThreadPoolExecutor(max_workers=4)

def parse_book_page(html: str) -> dict:
    """Parse a single book detail page (runs in thread pool)."""
    soup = BeautifulSoup(html, "lxml")

    return {
        "title": soup.select_one("h1").get_text(strip=True) if soup.select_one("h1") else None,
        "price": soup.select_one(".price_color").get_text(strip=True) if soup.select_one(".price_color") else None,
        "description": soup.select_one("#product_description ~ p").get_text(strip=True)
                        if soup.select_one("#product_description ~ p") else None,
        "upc": soup.select_one("table td").get_text(strip=True) if soup.select_one("table td") else None,
    }

async def parse_async(html: str) -> dict:
    """Run CPU-bound parsing in a thread pool without blocking the event loop."""
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(THREAD_POOL, parse_book_page, html)

Saving Results Safely with aiofiles

Writing to files from multiple concurrent tasks can cause data corruption if not handled carefully. Use a queue pattern to serialize writes:

import asyncio
import json
import aiofiles

async def result_writer(queue: asyncio.Queue, output_file: str):
    """Consume results from the queue and write them to a file."""
    async with aiofiles.open(output_file, "w", encoding="utf-8") as f:
        await f.write("[\n")
        first = True
        while True:
            item = await queue.get()
            if item is None:  # Sentinel value — signals completion
                break
            if not first:
                await f.write(",\n")
            await f.write(json.dumps(item, ensure_ascii=False))
            first = False
        await f.write("\n]")

async def scrape_and_save(urls: list[str], output_file: str):
    write_queue = asyncio.Queue()

    # Start the writer coroutine
    writer_task = asyncio.create_task(result_writer(write_queue, output_file))

    # Scrape and parse
    pages = await scrape_all_with_progress(urls)
    for url, html in pages.items():
        if html:
            parsed = await parse_async(html)
            parsed["source_url"] = url
            await write_queue.put(parsed)

    # Signal the writer to stop
    await write_queue.put(None)
    await writer_task

    print(f"Results written to {output_file}")

Complete Production Scraper

Here is the full, production-ready async scraper combining every piece above:

import asyncio
import random
import json
import time
import aiofiles
import httpx
from bs4 import BeautifulSoup
from tqdm.asyncio import tqdm
from concurrent.futures import ThreadPoolExecutor

# ── Configuration ────────────────────────────────────────────
CONCURRENCY   = 15          # Max simultaneous requests
MAX_RETRIES   = 3           # Retry failed requests up to N times
REQUEST_DELAY = (0.1, 0.5)  # Random delay range per request (seconds)
TIMEOUT       = 20.0        # Per-request timeout in seconds
HEADERS       = {"User-Agent": "Mozilla/5.0 (compatible; AsyncScraper/1.0)"}

semaphore   = asyncio.Semaphore(CONCURRENCY)
thread_pool = ThreadPoolExecutor(max_workers=4)


# ── Fetching ─────────────────────────────────────────────────
async def fetch(client: httpx.AsyncClient, url: str) -> tuple[str, str | None]:
    async with semaphore:
        for attempt in range(MAX_RETRIES):
            try:
                await asyncio.sleep(random.uniform(*REQUEST_DELAY))
                r = await client.get(url, timeout=TIMEOUT)
                if r.status_code == 429:
                    await asyncio.sleep(2 ** attempt + random.random())
                    continue
                r.raise_for_status()
                return url, r.text
            except (httpx.HTTPError, httpx.TimeoutException):
                if attempt < MAX_RETRIES - 1:
                    await asyncio.sleep(2 ** attempt)
        return url, None


# ── Parsing ──────────────────────────────────────────────────
def parse_book(html: str) -> dict:
    """Parse a book detail page. Runs in thread pool."""
    soup = BeautifulSoup(html, "lxml")

    def safe(selector, default=None):
        el = soup.select_one(selector)
        return el.get_text(strip=True) if el else default

    table = {
        row.select_one("th").text: row.select_one("td").text
        for row in soup.select("table tr")
    } if soup.select("table tr") else {}

    return {
        "title":       safe("h1"),
        "price":       safe(".price_color"),
        "rating":      (soup.select_one(".star-rating") or {}).get("class", ["", ""])[1],
        "availability":safe(".availability"),
        "description": safe("#product_description ~ p"),
        "upc":         table.get("UPC"),
        "num_reviews": table.get("Number of reviews"),
    }


# ── Discovery ─────────────────────────────────────────────────
async def get_all_book_urls(client: httpx.AsyncClient) -> list[str]:
    """Crawl the catalogue index to collect all book detail URLs."""
    base = "https://books.toscrape.com/catalogue/"
    urls = []
    page_url = "https://books.toscrape.com/catalogue/page-1.html"

    while page_url:
        _, html = await fetch(client, page_url)
        if not html:
            break
        soup = BeautifulSoup(html, "lxml")
        for a in soup.select("article.product_pod h3 a"):
            urls.append(base + a["href"].replace("../", ""))
        next_btn = soup.select_one("li.next a")
        page_url = base + next_btn["href"] if next_btn else None

    return urls


# ── Main pipeline ─────────────────────────────────────────────
async def main():
    start_time = time.time()

    limits = httpx.Limits(max_keepalive_connections=20, max_connections=30)

    async with httpx.AsyncClient(headers=HEADERS, limits=limits) as client:
        print("Discovering book URLs...")
        book_urls = await get_all_book_urls(client)
        print(f"Found {len(book_urls)} books. Starting async scrape...")

        tasks = [fetch(client, url) for url in book_urls]
        fetch_results = await tqdm.gather(*tasks, desc="Fetching", unit="page")

    # Parse concurrently in thread pool
    print("Parsing results...")
    loop = asyncio.get_running_loop()
    all_books = []

    parse_tasks = [
        loop.run_in_executor(thread_pool, parse_book, html)
        for _, html in fetch_results if html
    ]
    parsed = await tqdm.gather(*parse_tasks, desc="Parsing", unit="page")
    all_books = [p for p in parsed if p]

    # Save
    async with aiofiles.open("all_books_async.json", "w") as f:
        await f.write(json.dumps(all_books, indent=2, ensure_ascii=False))

    elapsed = time.time() - start_time
    print(f"\nDone! Scraped {len(all_books)} books in {elapsed:.1f} seconds.")
    print(f"Average: {elapsed/len(all_books)*1000:.0f}ms per book")


if __name__ == "__main__":
    asyncio.run(main())

Expected output:

Discovering book URLs...
Found 1000 books. Starting async scrape...
Fetching: 100%|████████████████| 1000/1000 [01:12<00:00, 13.8 page/s]
Parsing: 100%|███████████████| 1000/1000 [00:04<00:00, 231.4 page/s]

Done! Scraped 1000 books in 76.3 seconds.
Average: 76ms per book

Compare that with synchronous scraping at 1.5s/page: 1,500 seconds vs 76 seconds — a 20x speedup.

Async vs Sync: Benchmark Comparison

Here's a real benchmark scraping 100 pages:

Method	Total Time	Memory Usage	CPU Usage
Synchronous (requests)	~150 sec	~45 MB	~5%
Threading (50 threads)	~18 sec	~430 MB	~35%
Asyncio (semaphore=15)	~12 sec	~52 MB	~8%
Asyncio (semaphore=50)	~6 sec	~58 MB	~12%

Asyncio wins on both speed and memory efficiency.

Combining Async Scraping with a Proxy Pool

For sites that rate-limit by IP, rotate proxies asynchronously:

import itertools

PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

proxy_cycle = itertools.cycle(PROXIES)

async def fetch_with_proxy(url: str) -> tuple[str, str | None]:
    proxy = next(proxy_cycle)
    async with httpx.AsyncClient(proxy=proxy, headers=HEADERS) as client:
        try:
            r = await client.get(url, timeout=TIMEOUT)
            r.raise_for_status()
            return url, r.text
        except Exception:
            return url, None

When NOT to Use Async Scraping

Async scraping is powerful but not always the right tool:

JavaScript-rendered pages: httpx cannot execute JavaScript. Use Playwright (async mode) for those.
Sites with strict per-IP rate limits: More concurrency won't help if you're already at the IP limit. Use proxies instead.
Very small jobs (< 50 pages): The overhead of setting up async isn't worth it. Stick with synchronous requests.
CPU-heavy parsing per page: If each page takes 200ms to parse, the bottleneck is CPU, not I/O. Use multiprocessing instead.

Summary

Concept	What you learned
Why async	I/O-bound tasks benefit enormously from concurrency
httpx	Drop-in async replacement for requests
asyncio.gather()	Run multiple coroutines concurrently
Semaphore	Cap concurrency to avoid bans
Exponential backoff	Gracefully handle rate limits and transient errors
Thread pool	Offload CPU-bound parsing without blocking the event loop
aiofiles	Safe async file writes
Benchmarks	10–20x speedup over synchronous scraping

This isn't just a minor optimization. For any scraping job with more than a few hundred pages, async is the difference between a script you run overnight and one you run before your coffee cools.

By the end of this guide, you will:

Understand the difference between synchronous, threaded, and async I/O
Build a production-grade async scraper with httpx and asyncio
Control concurrency with semaphores to avoid getting banned
Handle errors, retries, and progress tracking
Save results safely from concurrent tasks

Why Not Just Use Threads?

A fair question. Python's threading module can also parallelize requests. So why asyncio?

Threading problems:

Python's Global Interpreter Lock (GIL) limits true parallelism
Each thread costs ~8MB of memory — 500 threads = 4GB RAM
Race conditions are subtle and hard to debug
Thread management overhead adds up

Asyncio advantages:

A single thread handles thousands of concurrent operations
Memory footprint is tiny — a coroutine costs ~1KB
No GIL contention for I/O-bound tasks
Explicit control flow with await — easier to reason about

For I/O-bound work (which scraping is — you spend most time waiting for network responses), asyncio consistently outperforms threading.

Rule of thumb: Use asyncio for I/O-bound tasks (HTTP, file reads, databases). Use multiprocessing for CPU-bound tasks (parsing large HTML, image processing).

How Async I/O Works

The key mental model: instead of blocking while waiting for a response, an async function yields control back to the event loop, which runs other tasks in the meantime.

Sync:    [req1 wait...] → [req2 wait...] → [req3 wait...]
          ← 4.5 seconds total →

Async:   [req1]→→→[req2]→→→[req3]
              ←responses arrive→
          ← ~1.5 seconds total →

Python's asyncio module provides the event loop. httpx is an HTTP client built from the ground up for async — unlike requests, which is purely synchronous.

Installation

pip install httpx[http2] beautifulsoup4 aiofiles tqdm

The [http2] extra enables HTTP/2 support, which can further speed up scraping on sites that support it.

The Async Basics

Before building the full scraper, let's understand the core async primitives:

import asyncio
import httpx

# An async function is defined with "async def"
# It returns a coroutine object, not a result

async def fetch_page(url: str) -> str:
    async with httpx.AsyncClient() as client:
        # "await" suspends this coroutine until the response arrives
        # Other coroutines can run during this wait
        response = await client.get(url)
        return response.text

# asyncio.run() starts the event loop and runs one coroutine
html = asyncio.run(fetch_page("https://books.toscrape.com/"))
print(len(html))

Running multiple coroutines at once with asyncio.gather():

async def main():
    urls = [
        "https://books.toscrape.com/catalogue/page-1.html",
        "https://books.toscrape.com/catalogue/page-2.html",
        "https://books.toscrape.com/catalogue/page-3.html",
    ]

    async with httpx.AsyncClient() as client:
        # gather() runs all coroutines concurrently
        results = await asyncio.gather(*[
            client.get(url) for url in urls
        ])

    for r in results:
        print(r.status_code, r.url)

asyncio.run(main())

All three requests fire nearly simultaneously. The total time is approximately the time of the slowest single request — not the sum of all three.

Controlling Concurrency with Semaphores

Firing 5,000 requests simultaneously will get your IP banned instantly — and might actually crash your own machine. asyncio.Semaphore is the tool for setting a ceiling on concurrent requests:

import asyncio
import httpx

# Only 15 requests can be in-flight at any given moment
SEMAPHORE = asyncio.Semaphore(15)

async def fetch_with_limit(client: httpx.AsyncClient, url: str) -> tuple[str, str | None]:
    """Fetch a URL, respecting the concurrency limit."""
    async with SEMAPHORE:  # Blocks if 15 requests are already running
        try:
            response = await client.get(url, timeout=15.0)
            response.raise_for_status()
            return url, response.text
        except (httpx.HTTPError, httpx.TimeoutException) as e:
            print(f"Failed: {url} — {type(e).__name__}")
            return url, None

Choosing the right semaphore value:

Target site	Safe concurrency
Small blogs / personal sites	3–5
Mid-size e-commerce	8–15
Large CDN-backed sites	20–50
APIs with explicit rate limits	Calculate from the limit (e.g. 100 req/min = ~1.6/sec)

Start conservative and increase only if you're not seeing 429 responses.

Adding Retries with Exponential Backoff

Network errors and transient failures are inevitable. A robust scraper retries failed requests with increasing delays:

import asyncio
import httpx
import random

SEMAPHORE = asyncio.Semaphore(15)

async def fetch_with_retry(
    client: httpx.AsyncClient,
    url: str,
    max_retries: int = 3
) -> tuple[str, str | None]:
    """Fetch with exponential backoff retries."""

    async with SEMAPHORE:
        for attempt in range(max_retries):
            try:
                # Add a small random delay to avoid synchronized bursts
                await asyncio.sleep(random.uniform(0.1, 0.5))

                response = await client.get(url, timeout=20.0)

                if response.status_code == 429:
                    # Rate limited — wait longer and retry
                    wait_time = (2 ** attempt) + random.uniform(0, 1)
                    print(f"Rate limited on {url}. Waiting {wait_time:.1f}s...")
                    await asyncio.sleep(wait_time)
                    continue

                response.raise_for_status()
                return url, response.text

            except httpx.TimeoutException:
                wait = 2 ** attempt
                print(f"Timeout on {url} (attempt {attempt+1}). Retrying in {wait}s...")
                await asyncio.sleep(wait)

            except httpx.HTTPStatusError as e:
                if e.response.status_code < 500:
                    # 4xx errors won't be fixed by retrying
                    return url, None
                await asyncio.sleep(2 ** attempt)

            except httpx.RequestError:
                await asyncio.sleep(2 ** attempt)

        print(f"Permanently failed: {url}")
        return url, None

Progress Tracking with tqdm

Async scrapers are harder to monitor because tasks finish out of order. tqdm solves this:

import asyncio
import httpx
from tqdm.asyncio import tqdm

async def scrape_all_with_progress(urls: list[str]) -> dict[str, str | None]:
    """Scrape all URLs with a live progress bar."""

    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; AsyncScraper/1.0)"
    }

    limits = httpx.Limits(
        max_keepalive_connections=20,
        max_connections=30
    )

    async with httpx.AsyncClient(headers=headers, limits=limits) as client:
        tasks = [fetch_with_retry(client, url) for url in urls]

        # tqdm.gather() shows progress as tasks complete
        results = await tqdm.gather(*tasks, desc="Scraping", unit="page")

    return {url: html for url, html in results if html}

Parsing Results Concurrently

Parsing HTML is CPU-bound, but since BeautifulSoup is fast for individual pages, running it inside async functions is fine. For very heavy parsing, offload to a thread pool:

import asyncio
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup

THREAD_POOL = ThreadPoolExecutor(max_workers=4)

def parse_book_page(html: str) -> dict:
    """Parse a single book detail page (runs in thread pool)."""
    soup = BeautifulSoup(html, "lxml")

    return {
        "title": soup.select_one("h1").get_text(strip=True) if soup.select_one("h1") else None,
        "price": soup.select_one(".price_color").get_text(strip=True) if soup.select_one(".price_color") else None,
        "description": soup.select_one("#product_description ~ p").get_text(strip=True)
                        if soup.select_one("#product_description ~ p") else None,
        "upc": soup.select_one("table td").get_text(strip=True) if soup.select_one("table td") else None,
    }

async def parse_async(html: str) -> dict:
    """Run CPU-bound parsing in a thread pool without blocking the event loop."""
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(THREAD_POOL, parse_book_page, html)

Saving Results Safely with aiofiles

Writing to files from multiple concurrent tasks can cause data corruption if not handled carefully. Use a queue pattern to serialize writes:

import asyncio
import json
import aiofiles

async def result_writer(queue: asyncio.Queue, output_file: str):
    """Consume results from the queue and write them to a file."""
    async with aiofiles.open(output_file, "w", encoding="utf-8") as f:
        await f.write("[\n")
        first = True
        while True:
            item = await queue.get()
            if item is None:  # Sentinel value — signals completion
                break
            if not first:
                await f.write(",\n")
            await f.write(json.dumps(item, ensure_ascii=False))
            first = False
        await f.write("\n]")

async def scrape_and_save(urls: list[str], output_file: str):
    write_queue = asyncio.Queue()

    # Start the writer coroutine
    writer_task = asyncio.create_task(result_writer(write_queue, output_file))

    # Scrape and parse
    pages = await scrape_all_with_progress(urls)
    for url, html in pages.items():
        if html:
            parsed = await parse_async(html)
            parsed["source_url"] = url
            await write_queue.put(parsed)

    # Signal the writer to stop
    await write_queue.put(None)
    await writer_task

    print(f"Results written to {output_file}")

Complete Production Scraper

Here is the full, production-ready async scraper combining every piece above:

import asyncio
import random
import json
import time
import aiofiles
import httpx
from bs4 import BeautifulSoup
from tqdm.asyncio import tqdm
from concurrent.futures import ThreadPoolExecutor

# ── Configuration ────────────────────────────────────────────
CONCURRENCY   = 15          # Max simultaneous requests
MAX_RETRIES   = 3           # Retry failed requests up to N times
REQUEST_DELAY = (0.1, 0.5)  # Random delay range per request (seconds)
TIMEOUT       = 20.0        # Per-request timeout in seconds
HEADERS       = {"User-Agent": "Mozilla/5.0 (compatible; AsyncScraper/1.0)"}

semaphore   = asyncio.Semaphore(CONCURRENCY)
thread_pool = ThreadPoolExecutor(max_workers=4)


# ── Fetching ─────────────────────────────────────────────────
async def fetch(client: httpx.AsyncClient, url: str) -> tuple[str, str | None]:
    async with semaphore:
        for attempt in range(MAX_RETRIES):
            try:
                await asyncio.sleep(random.uniform(*REQUEST_DELAY))
                r = await client.get(url, timeout=TIMEOUT)
                if r.status_code == 429:
                    await asyncio.sleep(2 ** attempt + random.random())
                    continue
                r.raise_for_status()
                return url, r.text
            except (httpx.HTTPError, httpx.TimeoutException):
                if attempt < MAX_RETRIES - 1:
                    await asyncio.sleep(2 ** attempt)
        return url, None


# ── Parsing ──────────────────────────────────────────────────
def parse_book(html: str) -> dict:
    """Parse a book detail page. Runs in thread pool."""
    soup = BeautifulSoup(html, "lxml")

    def safe(selector, default=None):
        el = soup.select_one(selector)
        return el.get_text(strip=True) if el else default

    table = {
        row.select_one("th").text: row.select_one("td").text
        for row in soup.select("table tr")
    } if soup.select("table tr") else {}

    return {
        "title":       safe("h1"),
        "price":       safe(".price_color"),
        "rating":      (soup.select_one(".star-rating") or {}).get("class", ["", ""])[1],
        "availability":safe(".availability"),
        "description": safe("#product_description ~ p"),
        "upc":         table.get("UPC"),
        "num_reviews": table.get("Number of reviews"),
    }


# ── Discovery ─────────────────────────────────────────────────
async def get_all_book_urls(client: httpx.AsyncClient) -> list[str]:
    """Crawl the catalogue index to collect all book detail URLs."""
    base = "https://books.toscrape.com/catalogue/"
    urls = []
    page_url = "https://books.toscrape.com/catalogue/page-1.html"

    while page_url:
        _, html = await fetch(client, page_url)
        if not html:
            break
        soup = BeautifulSoup(html, "lxml")
        for a in soup.select("article.product_pod h3 a"):
            urls.append(base + a["href"].replace("../", ""))
        next_btn = soup.select_one("li.next a")
        page_url = base + next_btn["href"] if next_btn else None

    return urls


# ── Main pipeline ─────────────────────────────────────────────
async def main():
    start_time = time.time()

    limits = httpx.Limits(max_keepalive_connections=20, max_connections=30)

    async with httpx.AsyncClient(headers=HEADERS, limits=limits) as client:
        print("Discovering book URLs...")
        book_urls = await get_all_book_urls(client)
        print(f"Found {len(book_urls)} books. Starting async scrape...")

        tasks = [fetch(client, url) for url in book_urls]
        fetch_results = await tqdm.gather(*tasks, desc="Fetching", unit="page")

    # Parse concurrently in thread pool
    print("Parsing results...")
    loop = asyncio.get_running_loop()
    all_books = []

    parse_tasks = [
        loop.run_in_executor(thread_pool, parse_book, html)
        for _, html in fetch_results if html
    ]
    parsed = await tqdm.gather(*parse_tasks, desc="Parsing", unit="page")
    all_books = [p for p in parsed if p]

    # Save
    async with aiofiles.open("all_books_async.json", "w") as f:
        await f.write(json.dumps(all_books, indent=2, ensure_ascii=False))

    elapsed = time.time() - start_time
    print(f"\nDone! Scraped {len(all_books)} books in {elapsed:.1f} seconds.")
    print(f"Average: {elapsed/len(all_books)*1000:.0f}ms per book")


if __name__ == "__main__":
    asyncio.run(main())

Expected output:

Discovering book URLs...
Found 1000 books. Starting async scrape...
Fetching: 100%|████████████████| 1000/1000 [01:12<00:00, 13.8 page/s]
Parsing: 100%|███████████████| 1000/1000 [00:04<00:00, 231.4 page/s]

Done! Scraped 1000 books in 76.3 seconds.
Average: 76ms per book

Compare that with synchronous scraping at 1.5s/page: 1,500 seconds vs 76 seconds — a 20x speedup.

Async vs Sync: Benchmark Comparison

Here's a real benchmark scraping 100 pages:

Method	Total Time	Memory Usage	CPU Usage
Synchronous (requests)	~150 sec	~45 MB	~5%
Threading (50 threads)	~18 sec	~430 MB	~35%
Asyncio (semaphore=15)	~12 sec	~52 MB	~8%
Asyncio (semaphore=50)	~6 sec	~58 MB	~12%

Asyncio wins on both speed and memory efficiency.

Combining Async Scraping with a Proxy Pool

For sites that rate-limit by IP, rotate proxies asynchronously:

import itertools

PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

proxy_cycle = itertools.cycle(PROXIES)

async def fetch_with_proxy(url: str) -> tuple[str, str | None]:
    proxy = next(proxy_cycle)
    async with httpx.AsyncClient(proxy=proxy, headers=HEADERS) as client:
        try:
            r = await client.get(url, timeout=TIMEOUT)
            r.raise_for_status()
            return url, r.text
        except Exception:
            return url, None

When NOT to Use Async Scraping

Async scraping is powerful but not always the right tool:

JavaScript-rendered pages: httpx cannot execute JavaScript. Use Playwright (async mode) for those.
Sites with strict per-IP rate limits: More concurrency won't help if you're already at the IP limit. Use proxies instead.
Very small jobs (< 50 pages): The overhead of setting up async isn't worth it. Stick with synchronous requests.
CPU-heavy parsing per page: If each page takes 200ms to parse, the bottleneck is CPU, not I/O. Use multiprocessing instead.

Summary

Concept	What you learned
Why async	I/O-bound tasks benefit enormously from concurrency
httpx	Drop-in async replacement for requests
asyncio.gather()	Run multiple coroutines concurrently
Semaphore	Cap concurrency to avoid bans
Exponential backoff	Gracefully handle rate limits and transient errors
Thread pool	Offload CPU-bound parsing without blocking the event loop
aiofiles	Safe async file writes
Benchmarks	10–20x speedup over synchronous scraping

Async Scraping with Python: httpx & asyncio — Build a 10x Faster Scraper

Why Not Just Use Threads?

How Async I/O Works

Installation

The Async Basics

Controlling Concurrency with Semaphores

Adding Retries with Exponential Backoff

Progress Tracking with tqdm

Parsing Results Concurrently

Saving Results Safely with aiofiles

Complete Production Scraper

Async vs Sync: Benchmark Comparison

Combining Async Scraping with a Proxy Pool

When NOT to Use Async Scraping

Summary

ZyVOP

Comments (0)

Async Scraping with Python: httpx & asyncio — Build a 10x Faster Scraper

Why Not Just Use Threads?

How Async I/O Works

Installation

The Async Basics

Controlling Concurrency with Semaphores

Adding Retries with Exponential Backoff

Progress Tracking with tqdm

Parsing Results Concurrently

Saving Results Safely with aiofiles

Complete Production Scraper

Async vs Sync: Benchmark Comparison

Combining Async Scraping with a Proxy Pool

When NOT to Use Async Scraping

Summary

ZyVOP

Comments (0)

Popular Tags