Async Scraping with Python: httpx & asyncio — Build a 10x Faster Scraper
Scrape 10x faster by replacing synchronous requests with async httpx and asyncio — with semaphore concurrency control, exponential backoff retries, and live progress bars.
Senior Developer

You've mastered the basics with requests and BeautifulSoup. Now imagine you need to scrape 5,000 product pages. With synchronous scraping, each request waits for the previous one to finish. At 1.5 seconds per page, that's over two hours of runtime.
Async scraping changes the equation entirely. Instead of waiting idle while the server thinks, Python fires off dozens of requests simultaneously — and your 5,000-page scrape finishes in under ten minutes.
This isn't just a minor optimization. For any scraping job with more than a few hundred pages, async is the difference between a script you run overnight and one you run before your coffee cools.
By the end of this guide, you will:
Understand the difference between synchronous, threaded, and async I/O
Build a production-grade async scraper with
httpxandasyncioControl concurrency with semaphores to avoid getting banned
Handle errors, retries, and progress tracking
Save results safely from concurrent tasks
Why Not Just Use Threads?
A fair question. Python's threading module can also parallelize requests. So why asyncio?
Threading problems:
Python's Global Interpreter Lock (GIL) limits true parallelism
Each thread costs ~8MB of memory — 500 threads = 4GB RAM
Race conditions are subtle and hard to debug
Thread management overhead adds up
Asyncio advantages:
A single thread handles thousands of concurrent operations
Memory footprint is tiny — a coroutine costs ~1KB
No GIL contention for I/O-bound tasks
Explicit control flow with
await— easier to reason about
For I/O-bound work (which scraping is — you spend most time waiting for network responses), asyncio consistently outperforms threading.
Rule of thumb: Use asyncio for I/O-bound tasks (HTTP, file reads, databases). Use
multiprocessingfor CPU-bound tasks (parsing large HTML, image processing).
How Async I/O Works
The key mental model: instead of blocking while waiting for a response, an async function yields control back to the event loop, which runs other tasks in the meantime.
Sync: [req1 wait...] → [req2 wait...] → [req3 wait...]
← 4.5 seconds total →
Async: [req1]→→→[req2]→→→[req3]
←responses arrive→
← ~1.5 seconds total →Python's asyncio module provides the event loop. httpx is an HTTP client built from the ground up for async — unlike requests, which is purely synchronous.
Installation
pip install httpx[http2] beautifulsoup4 aiofiles tqdmThe [http2] extra enables HTTP/2 support, which can further speed up scraping on sites that support it.
The Async Basics
Before building the full scraper, let's understand the core async primitives:
import asyncio
import httpx
# An async function is defined with "async def"
# It returns a coroutine object, not a result
async def fetch_page(url: str) -> str:
async with httpx.AsyncClient() as client:
# "await" suspends this coroutine until the response arrives
# Other coroutines can run during this wait
response = await client.get(url)
return response.text
# asyncio.run() starts the event loop and runs one coroutine
html = asyncio.run(fetch_page("https://books.toscrape.com/"))
print(len(html))Running multiple coroutines at once with asyncio.gather():
async def main():
urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
]
async with httpx.AsyncClient() as client:
# gather() runs all coroutines concurrently
results = await asyncio.gather(*[
client.get(url) for url in urls
])
for r in results:
print(r.status_code, r.url)
asyncio.run(main())All three requests fire nearly simultaneously. The total time is approximately the time of the slowest single request — not the sum of all three.
Controlling Concurrency with Semaphores
Firing 5,000 requests simultaneously will get your IP banned instantly — and might actually crash your own machine. asyncio.Semaphore is the tool for setting a ceiling on concurrent requests:
import asyncio
import httpx
# Only 15 requests can be in-flight at any given moment
SEMAPHORE = asyncio.Semaphore(15)
async def fetch_with_limit(client: httpx.AsyncClient, url: str) -> tuple[str, str | None]:
"""Fetch a URL, respecting the concurrency limit."""
async with SEMAPHORE: # Blocks if 15 requests are already running
try:
response = await client.get(url, timeout=15.0)
response.raise_for_status()
return url, response.text
except (httpx.HTTPError, httpx.TimeoutException) as e:
print(f"Failed: {url} — {type(e).__name__}")
return url, NoneChoosing the right semaphore value:
Target site | Safe concurrency |
|---|---|
Small blogs / personal sites | 3–5 |
Mid-size e-commerce | 8–15 |
Large CDN-backed sites | 20–50 |
APIs with explicit rate limits | Calculate from the limit (e.g. 100 req/min = ~1.6/sec) |
Start conservative and increase only if you're not seeing 429 responses.
Adding Retries with Exponential Backoff
Network errors and transient failures are inevitable. A robust scraper retries failed requests with increasing delays:
import asyncio
import httpx
import random
SEMAPHORE = asyncio.Semaphore(15)
async def fetch_with_retry(
client: httpx.AsyncClient,
url: str,
max_retries: int = 3
) -> tuple[str, str | None]:
"""Fetch with exponential backoff retries."""
async with SEMAPHORE:
for attempt in range(max_retries):
try:
# Add a small random delay to avoid synchronized bursts
await asyncio.sleep(random.uniform(0.1, 0.5))
response = await client.get(url, timeout=20.0)
if response.status_code == 429:
# Rate limited — wait longer and retry
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited on {url}. Waiting {wait_time:.1f}s...")
await asyncio.sleep(wait_time)
continue
response.raise_for_status()
return url, response.text
except httpx.TimeoutException:
wait = 2 ** attempt
print(f"Timeout on {url} (attempt {attempt+1}). Retrying in {wait}s...")
await asyncio.sleep(wait)
except httpx.HTTPStatusError as e:
if e.response.status_code < 500:
# 4xx errors won't be fixed by retrying
return url, None
await asyncio.sleep(2 ** attempt)
except httpx.RequestError:
await asyncio.sleep(2 ** attempt)
print(f"Permanently failed: {url}")
return url, NoneProgress Tracking with tqdm
Async scrapers are harder to monitor because tasks finish out of order. tqdm solves this:
import asyncio
import httpx
from tqdm.asyncio import tqdm
async def scrape_all_with_progress(urls: list[str]) -> dict[str, str | None]:
"""Scrape all URLs with a live progress bar."""
headers = {
"User-Agent": "Mozilla/5.0 (compatible; AsyncScraper/1.0)"
}
limits = httpx.Limits(
max_keepalive_connections=20,
max_connections=30
)
async with httpx.AsyncClient(headers=headers, limits=limits) as client:
tasks = [fetch_with_retry(client, url) for url in urls]
# tqdm.gather() shows progress as tasks complete
results = await tqdm.gather(*tasks, desc="Scraping", unit="page")
return {url: html for url, html in results if html}Parsing Results Concurrently
Parsing HTML is CPU-bound, but since BeautifulSoup is fast for individual pages, running it inside async functions is fine. For very heavy parsing, offload to a thread pool:
import asyncio
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup
THREAD_POOL = ThreadPoolExecutor(max_workers=4)
def parse_book_page(html: str) -> dict:
"""Parse a single book detail page (runs in thread pool)."""
soup = BeautifulSoup(html, "lxml")
return {
"title": soup.select_one("h1").get_text(strip=True) if soup.select_one("h1") else None,
"price": soup.select_one(".price_color").get_text(strip=True) if soup.select_one(".price_color") else None,
"description": soup.select_one("#product_description ~ p").get_text(strip=True)
if soup.select_one("#product_description ~ p") else None,
"upc": soup.select_one("table td").get_text(strip=True) if soup.select_one("table td") else None,
}
async def parse_async(html: str) -> dict:
"""Run CPU-bound parsing in a thread pool without blocking the event loop."""
loop = asyncio.get_running_loop()
return await loop.run_in_executor(THREAD_POOL, parse_book_page, html)Saving Results Safely with aiofiles
Writing to files from multiple concurrent tasks can cause data corruption if not handled carefully. Use a queue pattern to serialize writes:
import asyncio
import json
import aiofiles
async def result_writer(queue: asyncio.Queue, output_file: str):
"""Consume results from the queue and write them to a file."""
async with aiofiles.open(output_file, "w", encoding="utf-8") as f:
await f.write("[\n")
first = True
while True:
item = await queue.get()
if item is None: # Sentinel value — signals completion
break
if not first:
await f.write(",\n")
await f.write(json.dumps(item, ensure_ascii=False))
first = False
await f.write("\n]")
async def scrape_and_save(urls: list[str], output_file: str):
write_queue = asyncio.Queue()
# Start the writer coroutine
writer_task = asyncio.create_task(result_writer(write_queue, output_file))
# Scrape and parse
pages = await scrape_all_with_progress(urls)
for url, html in pages.items():
if html:
parsed = await parse_async(html)
parsed["source_url"] = url
await write_queue.put(parsed)
# Signal the writer to stop
await write_queue.put(None)
await writer_task
print(f"Results written to {output_file}")Complete Production Scraper
Here is the full, production-ready async scraper combining every piece above:
import asyncio
import random
import json
import time
import aiofiles
import httpx
from bs4 import BeautifulSoup
from tqdm.asyncio import tqdm
from concurrent.futures import ThreadPoolExecutor
# ── Configuration ────────────────────────────────────────────
CONCURRENCY = 15 # Max simultaneous requests
MAX_RETRIES = 3 # Retry failed requests up to N times
REQUEST_DELAY = (0.1, 0.5) # Random delay range per request (seconds)
TIMEOUT = 20.0 # Per-request timeout in seconds
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; AsyncScraper/1.0)"}
semaphore = asyncio.Semaphore(CONCURRENCY)
thread_pool = ThreadPoolExecutor(max_workers=4)
# ── Fetching ─────────────────────────────────────────────────
async def fetch(client: httpx.AsyncClient, url: str) -> tuple[str, str | None]:
async with semaphore:
for attempt in range(MAX_RETRIES):
try:
await asyncio.sleep(random.uniform(*REQUEST_DELAY))
r = await client.get(url, timeout=TIMEOUT)
if r.status_code == 429:
await asyncio.sleep(2 ** attempt + random.random())
continue
r.raise_for_status()
return url, r.text
except (httpx.HTTPError, httpx.TimeoutException):
if attempt < MAX_RETRIES - 1:
await asyncio.sleep(2 ** attempt)
return url, None
# ── Parsing ──────────────────────────────────────────────────
def parse_book(html: str) -> dict:
"""Parse a book detail page. Runs in thread pool."""
soup = BeautifulSoup(html, "lxml")
def safe(selector, default=None):
el = soup.select_one(selector)
return el.get_text(strip=True) if el else default
table = {
row.select_one("th").text: row.select_one("td").text
for row in soup.select("table tr")
} if soup.select("table tr") else {}
return {
"title": safe("h1"),
"price": safe(".price_color"),
"rating": (soup.select_one(".star-rating") or {}).get("class", ["", ""])[1],
"availability":safe(".availability"),
"description": safe("#product_description ~ p"),
"upc": table.get("UPC"),
"num_reviews": table.get("Number of reviews"),
}
# ── Discovery ─────────────────────────────────────────────────
async def get_all_book_urls(client: httpx.AsyncClient) -> list[str]:
"""Crawl the catalogue index to collect all book detail URLs."""
base = "https://books.toscrape.com/catalogue/"
urls = []
page_url = "https://books.toscrape.com/catalogue/page-1.html"
while page_url:
_, html = await fetch(client, page_url)
if not html:
break
soup = BeautifulSoup(html, "lxml")
for a in soup.select("article.product_pod h3 a"):
urls.append(base + a["href"].replace("../", ""))
next_btn = soup.select_one("li.next a")
page_url = base + next_btn["href"] if next_btn else None
return urls
# ── Main pipeline ─────────────────────────────────────────────
async def main():
start_time = time.time()
limits = httpx.Limits(max_keepalive_connections=20, max_connections=30)
async with httpx.AsyncClient(headers=HEADERS, limits=limits) as client:
print("Discovering book URLs...")
book_urls = await get_all_book_urls(client)
print(f"Found {len(book_urls)} books. Starting async scrape...")
tasks = [fetch(client, url) for url in book_urls]
fetch_results = await tqdm.gather(*tasks, desc="Fetching", unit="page")
# Parse concurrently in thread pool
print("Parsing results...")
loop = asyncio.get_running_loop()
all_books = []
parse_tasks = [
loop.run_in_executor(thread_pool, parse_book, html)
for _, html in fetch_results if html
]
parsed = await tqdm.gather(*parse_tasks, desc="Parsing", unit="page")
all_books = [p for p in parsed if p]
# Save
async with aiofiles.open("all_books_async.json", "w") as f:
await f.write(json.dumps(all_books, indent=2, ensure_ascii=False))
elapsed = time.time() - start_time
print(f"\nDone! Scraped {len(all_books)} books in {elapsed:.1f} seconds.")
print(f"Average: {elapsed/len(all_books)*1000:.0f}ms per book")
if __name__ == "__main__":
asyncio.run(main())Expected output:
Discovering book URLs...
Found 1000 books. Starting async scrape...
Fetching: 100%|████████████████| 1000/1000 [01:12<00:00, 13.8 page/s]
Parsing: 100%|███████████████| 1000/1000 [00:04<00:00, 231.4 page/s]
Done! Scraped 1000 books in 76.3 seconds.
Average: 76ms per bookCompare that with synchronous scraping at 1.5s/page: 1,500 seconds vs 76 seconds — a 20x speedup.
Async vs Sync: Benchmark Comparison
Here's a real benchmark scraping 100 pages:
Method | Total Time | Memory Usage | CPU Usage |
|---|---|---|---|
Synchronous (requests) | ~150 sec | ~45 MB | ~5% |
Threading (50 threads) | ~18 sec | ~430 MB | ~35% |
Asyncio (semaphore=15) | ~12 sec | ~52 MB | ~8% |
Asyncio (semaphore=50) | ~6 sec | ~58 MB | ~12% |
Asyncio wins on both speed and memory efficiency.
Combining Async Scraping with a Proxy Pool
For sites that rate-limit by IP, rotate proxies asynchronously:
import itertools
PROXIES = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
proxy_cycle = itertools.cycle(PROXIES)
async def fetch_with_proxy(url: str) -> tuple[str, str | None]:
proxy = next(proxy_cycle)
async with httpx.AsyncClient(proxy=proxy, headers=HEADERS) as client:
try:
r = await client.get(url, timeout=TIMEOUT)
r.raise_for_status()
return url, r.text
except Exception:
return url, NoneWhen NOT to Use Async Scraping
Async scraping is powerful but not always the right tool:
JavaScript-rendered pages: httpx cannot execute JavaScript. Use Playwright (async mode) for those.
Sites with strict per-IP rate limits: More concurrency won't help if you're already at the IP limit. Use proxies instead.
Very small jobs (< 50 pages): The overhead of setting up async isn't worth it. Stick with synchronous requests.
CPU-heavy parsing per page: If each page takes 200ms to parse, the bottleneck is CPU, not I/O. Use multiprocessing instead.
Summary
Concept | What you learned |
|---|---|
Why async | I/O-bound tasks benefit enormously from concurrency |
httpx | Drop-in async replacement for requests |
asyncio.gather() | Run multiple coroutines concurrently |
Semaphore | Cap concurrency to avoid bans |
Exponential backoff | Gracefully handle rate limits and transient errors |
Thread pool | Offload CPU-bound parsing without blocking the event loop |
aiofiles | Safe async file writes |
Benchmarks | 10–20x speedup over synchronous scraping |
Comments (0)
Login to post a comment.