Scraping Wikipedia & Open Government Data with Python (2026 Complete Guide)
Learn how to scrape Wikipedia articles, infoboxes, and tables with Python in 2026 — plus access 100,000+ datasets from India's data.gov.in and the US data.gov APIs. Full working code.
Senior Developer

Introduction: The Cleanest Data on the Internet
Every blog in this series so far has dealt with the hard problem: sites that don't want you scraping them. Anti-bot systems, aggressive rate limiting, JavaScript-rendered content, accounts that get banned, legal grey areas.
Wikipedia and open government data platforms are the complete opposite. They're designed to be scraped. They want you to use their data. They provide official APIs, detailed documentation, and explicit licensing that allows free commercial and non-commercial use. Wikipedia's content is published under Creative Commons Attribution-ShareAlike. The Indian government's data.gov.in platform explicitly invites developers to build on its 100,000+ APIs. The US data.gov provides open access to hundreds of thousands of federal datasets.
This is the most beginner-friendly blog in the entire series — but the data available through these sources is extraordinarily powerful. Wikipedia alone contains structured information on over 6.7 million topics in English, with infoboxes full of structured data, tables full of statistics, and cross-linked articles ideal for knowledge graph construction.
By the end of this guide you'll be able to:
Use the Wikipedia API and
wikipedia-apilibrary to extract articles, summaries, and linksParse Wikipedia infoboxes and tables into clean pandas DataFrames
Build a multi-article knowledge scraper with link following
Access India's data.gov.in platform and its 100,000+ government datasets
Download and analyse US federal datasets from data.gov
Combine Wikipedia and government data into a unified research dataset
Part 1: Wikipedia — Four Ways to Access Data
Wikipedia offers four distinct access methods, each suited to different use cases:
Method | Best for | Rate limit | Effort |
|---|---|---|---|
| Article text, summaries, links | Generous | Very Low |
MediaWiki REST API | Any structured query | Generous | Low |
| Infoboxes, structured data | Generous | Low |
Direct HTML scraping | Tables, custom extraction | Generous | Medium |
Database dumps | Bulk data (millions of articles) | None | High |
Wikipedia's robots.txt explicitly allows scraping for research use. Their only request: use the API rather than raw HTML scraping wherever possible, and add a descriptive User-Agent.
Method 1: The wikipedia-api Library (Easiest)
Extracting properly organized data from Wikipedia can simplify and speed up your research, NLP training, or content scraping processes. With just a few libraries — wikipedia, BeautifulSoup, and pandas — you can transform unstructured encyclopedia content into usable data.
pip install wikipedia-api pandas
import wikipediaapi
# Create a Wikipedia API client
# Always include a descriptive User-Agent with contact info
wiki = wikipediaapi.Wikipedia(
language="en",
user_agent="ResearchBot/1.0 (contact@yourdomain.com; educational use)"
)
def get_article(title: str) -> dict | None:
"""
Fetch a Wikipedia article by title.
Returns structured data including summary, full text, links, and sections.
"""
page = wiki.page(title)
if not page.exists():
print(f"Article not found: '{title}'")
return None
return {
"title": page.title,
"url": page.fullurl,
"summary": page.summary, # First 2–3 paragraphs
"full_text": page.text, # Complete article text
"word_count": len(page.text.split()),
"sections": [s.title for s in page.sections],
"links": list(page.links.keys())[:50], # First 50 linked articles
"categories": list(page.categories.keys())[:20],
"language": page.language,
}
# Single article
article = get_article("Machine Learning")
if article:
print(f"Title: {article['title']}")
print(f"Words: {article['word_count']:,}")
print(f"Sections: {', '.join(article['sections'][:5])}")
print(f"Summary: {article['summary'][:300]}...")
print(f"Links to: {', '.join(article['links'][:8])}")
Batch fetching multiple articles
import time
import pandas as pd
def batch_fetch_articles(titles: list[str], delay: float = 0.5) -> pd.DataFrame:
"""
Fetch multiple Wikipedia articles and return as a DataFrame.
Includes polite delay between requests.
"""
records = []
for i, title in enumerate(titles):
print(f"[{i+1}/{len(titles)}] Fetching: {title}")
article = get_article(title)
if article:
records.append({
"title": article["title"],
"url": article["url"],
"word_count": article["word_count"],
"sections": len(article["sections"]),
"links": len(article["links"]),
"summary": article["summary"][:500],
})
time.sleep(delay) # Wikipedia asks for polite delays
return pd.DataFrame(records)
# Fetch articles on Python scraping ecosystem
topics = [
"Web scraping", "BeautifulSoup", "Scrapy (software)",
"Playwright (software)", "Selenium (software)",
"XPath", "CSS", "HTML", "JSON", "Python (programming language)"
]
df = batch_fetch_articles(topics)
df.to_csv("wikipedia_articles.csv", index=False)
print(df[["title", "word_count", "sections", "links"]])
Method 2: The MediaWiki REST API (Most Powerful)
The MediaWiki API is Wikipedia's official programmatic interface. It provides far more control than the wikipedia-api library — you can query search results, get page metadata, fetch revision history, and access structured Wikidata.
import httpx
import asyncio
import pandas as pd
WIKI_API = "https://en.wikipedia.org/w/api.php"
HEADERS = {
"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"
}
async def wiki_api_request(params: dict) -> dict:
"""Make a request to the MediaWiki API."""
params["format"] = "json"
async with httpx.AsyncClient(headers=HEADERS) as client:
r = await client.get(WIKI_API, params=params, timeout=15)
r.raise_for_status()
return r.json()
async def search_wikipedia(query: str, limit: int = 10) -> list[dict]:
"""Full-text search across Wikipedia."""
data = await wiki_api_request({
"action": "query",
"list": "search",
"srsearch": query,
"srlimit": limit,
"srprop": "snippet|titlesnippet|size|wordcount",
})
results = []
for item in data.get("query", {}).get("search", []):
# Strip HTML tags from snippet
import re
snippet = re.sub(r"<[^>]+>", "", item.get("snippet", ""))
results.append({
"title": item["title"],
"snippet": snippet,
"word_count": item.get("wordcount", 0),
"size_bytes": item.get("size", 0),
"url": f"https://en.wikipedia.org/wiki/{item['title'].replace(' ', '_')}",
})
return results
async def get_page_summary(title: str) -> dict:
"""Get a concise summary using Wikipedia's REST summary endpoint."""
url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title.replace(' ', '_')}"
async with httpx.AsyncClient(headers=HEADERS) as client:
r = await client.get(url, timeout=10)
if r.status_code == 404:
return {}
data = r.json()
return {
"title": data.get("title"),
"description": data.get("description"),
"extract": data.get("extract"), # Plain text summary
"thumbnail": data.get("thumbnail", {}).get("source"),
"url": data.get("content_urls", {}).get("desktop", {}).get("page"),
"wikidata_id": data.get("wikibase_item"), # For Wikidata queries
}
async def get_page_links(title: str, limit: int = 100) -> list[str]:
"""Get all internal links from a Wikipedia page."""
data = await wiki_api_request({
"action": "query",
"titles": title,
"prop": "links",
"pllimit": limit,
"plnamespace": 0, # Only article namespace (not talk pages etc.)
})
pages = data.get("query", {}).get("pages", {})
links = []
for page_data in pages.values():
for link in page_data.get("links", []):
links.append(link["title"])
return links
async def get_page_images(title: str) -> list[dict]:
"""Get all images used in a Wikipedia article."""
data = await wiki_api_request({
"action": "query",
"titles": title,
"prop": "images",
"imlimit": 50,
})
pages = data.get("query", {}).get("pages", {})
images = []
for page_data in pages.values():
for img in page_data.get("images", []):
name = img["title"].replace("File:", "")
if any(name.lower().endswith(ext) for ext in [".jpg", ".png", ".svg", ".gif"]):
images.append({
"filename": name,
"wiki_url": f"https://commons.wikimedia.org/wiki/File:{name.replace(' ', '_')}",
})
return images
async def main_api():
# Search
print("── Search results for 'python web scraping' ──")
results = await search_wikipedia("python web scraping", limit=5)
for r in results:
print(f" {r['title']}: {r['snippet'][:80]}...")
# Summary
print("\n── Page summary ──")
summary = await get_page_summary("Web scraping")
print(f" {summary['title']}: {summary['extract'][:200]}...")
# Links
print("\n── Top links from 'Web scraping' article ──")
links = await get_page_links("Web scraping", limit=20)
print(f" {', '.join(links[:10])}")
asyncio.run(main_api())
Method 3: Extracting Wikipedia Infoboxes
Infoboxes are the structured data panels on the right side of Wikipedia articles — they contain the most machine-readable data on Wikipedia: population, area, GDP, founding date, CEO, headquarters, etc.
You can use BeautifulSoup to find and parse infobox content, transforming unstructured web content into structured datasets useful for NLP training, trend analysis, or data journalism.
import httpx
import pandas as pd
from bs4 import BeautifulSoup
import re
HEADERS = {"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"}
def fetch_wikipedia_html(title: str) -> str:
"""Fetch the raw HTML of a Wikipedia page."""
url = f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}"
r = httpx.get(url, headers=HEADERS, follow_redirects=True, timeout=15)
r.raise_for_status()
return r.text
def parse_infobox(html: str) -> dict:
"""
Extract key-value pairs from a Wikipedia infobox.
Works with most infobox types: country, person, company, city, film, etc.
"""
soup = BeautifulSoup(html, "lxml")
infobox = soup.select_one("table.infobox")
if not infobox:
return {}
data = {}
for row in infobox.select("tr"):
# Standard two-column layout: th = label, td = value
label_el = row.select_one("th")
value_el = row.select_one("td")
if label_el and value_el:
label = label_el.get_text(separator=" ", strip=True)
# Strip footnote numbers like [1], [2]
value = re.sub(r"\[\d+\]", "", value_el.get_text(separator=" ", strip=True))
value = re.sub(r"\s+", " ", value).strip()
if label and value and len(label) < 80:
data[label] = value
return data
def parse_all_tables(html: str) -> list[pd.DataFrame]:
"""
Extract all wikitables from a Wikipedia page as DataFrames.
Great for statistical tables, comparison tables, rankings.
"""
try:
tables = pd.read_html(html, flavor="lxml")
return tables
except Exception:
return []
# Example: Extract country data from Wikipedia infoboxes
countries = ["India", "China", "United_States", "Brazil", "Germany"]
country_data = []
for country in countries:
print(f"Fetching: {country}...")
html = fetch_wikipedia_html(country)
infobox = parse_infobox(html)
tables = parse_all_tables(html)
# Standardise common infobox fields
country_data.append({
"country": country,
"capital": infobox.get("Capital") or infobox.get("Capital city"),
"population": infobox.get("Population"),
"area": infobox.get("Area"),
"gdp_nominal": infobox.get("GDP (nominal)") or infobox.get("GDP (PPP)"),
"currency": infobox.get("Currency"),
"language": infobox.get("Official languages"),
"government": infobox.get("Government"),
"tables_found": len(tables),
})
time.sleep(0.5)
df = pd.DataFrame(country_data)
print(df.to_string(index=False))
df.to_csv("country_infoboxes.csv", index=False)
Extracting specific tables (rankings, statistics)
def get_wikipedia_table(title: str, table_index: int = 0, match_text: str = None) -> pd.DataFrame | None:
"""
Extract a specific table from a Wikipedia page.
Args:
title: Wikipedia article title
table_index: Which table to return (0 = first)
match_text: Return the first table whose header contains this text
"""
html = fetch_wikipedia_html(title)
tables = parse_all_tables(html)
if not tables:
print(f"No tables found on '{title}'")
return None
if match_text:
for t in tables:
# Check if any column header contains match_text
if any(match_text.lower() in str(col).lower() for col in t.columns):
return t
print(f"No table with '{match_text}' found")
return None
if table_index >= len(tables):
print(f"Table index {table_index} out of range (found {len(tables)} tables)")
return None
return tables[table_index]
# Get the list of largest cities by population
cities_df = get_wikipedia_table(
"List of largest cities",
match_text="Population"
)
if cities_df is not None:
print(f"Found table with {len(cities_df)} rows")
print(cities_df.head(10))
cities_df.to_csv("largest_cities.csv", index=False)
# Get Nobel Prize winners table
nobel_df = get_wikipedia_table("List of Nobel laureates", table_index=0)
if nobel_df is not None:
print(f"\nNobel laureates table: {len(nobel_df)} entries")
print(nobel_df.head(5))
Method 4: Building a Wikipedia Knowledge Graph Scraper
Wikipedia's internal links create a natural knowledge graph. Following links allows you to scrape entire topic clusters:
import asyncio
import httpx
import json
import time
from collections import deque
HEADERS = {"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"}
async def build_knowledge_graph(
seed_topic: str,
max_articles: int = 50,
max_depth: int = 2,
language: str = "en"
) -> dict:
"""
Build a knowledge graph by crawling Wikipedia starting from a seed topic.
Follows internal links up to max_depth levels.
Returns: {
"nodes": [{title, summary, url, depth}],
"edges": [(source_title, target_title)]
}
"""
nodes = {}
edges = []
queue = deque([(seed_topic, 0)])
visited = set()
async with httpx.AsyncClient(headers=HEADERS) as client:
while queue and len(nodes) < max_articles:
title, depth = queue.popleft()
if title in visited or depth > max_depth:
continue
visited.add(title)
print(f" [depth={depth}] {title} ({len(nodes)}/{max_articles})")
# Fetch summary
try:
r = await client.get(
f"https://en.wikipedia.org/api/rest_v1/page/summary/"
f"{title.replace(' ', '_')}",
timeout=10
)
if r.status_code != 200:
continue
data = r.json()
except Exception:
continue
nodes[title] = {
"title": data.get("title", title),
"description": data.get("description", ""),
"summary": (data.get("extract") or "")[:300],
"url": data.get("content_urls", {})
.get("desktop", {}).get("page", ""),
"depth": depth,
}
# Fetch links if not at max depth
if depth < max_depth:
links_r = await client.get(
"https://en.wikipedia.org/w/api.php",
params={
"action": "query", "titles": title,
"prop": "links", "pllimit": 20,
"plnamespace": 0, "format": "json"
},
timeout=10
)
links_data = links_r.json()
pages = links_data.get("query", {}).get("pages", {})
for page_data in pages.values():
for link in page_data.get("links", []):
link_title = link["title"]
edges.append((title, link_title))
if link_title not in visited:
queue.append((link_title, depth + 1))
await asyncio.sleep(0.3) # Polite delay
return {
"seed": seed_topic,
"nodes": list(nodes.values()),
"edges": edges[:500], # Limit edges for manageability
"stats": {
"total_articles": len(nodes),
"total_edges": len(edges),
"max_depth": max_depth,
}
}
async def main_graph():
print("Building knowledge graph for 'Machine Learning'...")
graph = await build_knowledge_graph(
seed_topic="Machine learning",
max_articles=30,
max_depth=2
)
print(f"\nGraph built:")
print(f" Articles: {graph['stats']['total_articles']}")
print(f" Connections: {graph['stats']['total_edges']}")
# Save as JSON for use in graph tools (NetworkX, Gephi, D3.js)
with open("knowledge_graph.json", "w") as f:
json.dump(graph, f, indent=2, ensure_ascii=False)
# Save nodes as CSV for analysis
nodes_df = pd.DataFrame(graph["nodes"])
nodes_df.to_csv("knowledge_graph_nodes.csv", index=False)
print(f"\nSaved to knowledge_graph.json and knowledge_graph_nodes.csv")
asyncio.run(main_graph())
Part 2: Wikidata — Structured Data from Wikipedia
Wikidata is the structured data backbone behind Wikipedia — a giant knowledge base of facts in machine-readable form. Every Wikipedia article links to a Wikidata entity with typed properties.
import httpx
import asyncio
import pandas as pd
WIKIDATA_SPARQL = "https://query.wikidata.org/sparql"
HEADERS = {
"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)",
"Accept": "application/sparql-results+json"
}
async def wikidata_query(sparql: str) -> list[dict]:
"""
Execute a SPARQL query against Wikidata.
Returns a list of result dicts.
"""
async with httpx.AsyncClient(headers=HEADERS) as client:
r = await client.get(
WIKIDATA_SPARQL,
params={"query": sparql, "format": "json"},
timeout=30
)
r.raise_for_status()
data = r.json()
bindings = data.get("results", {}).get("bindings", [])
results = []
for row in bindings:
results.append({
key: val.get("value") for key, val in row.items()
})
return results
async def get_country_data() -> pd.DataFrame:
"""Get population and GDP data for all countries from Wikidata."""
sparql = """
SELECT ?country ?countryLabel ?population ?gdp ?capital ?capitalLabel WHERE {
?country wdt:P31 wd:Q3624078. # Instance of: sovereign state
OPTIONAL { ?country wdt:P1082 ?population. }
OPTIONAL { ?country wdt:P2131 ?gdp. }
OPTIONAL { ?country wdt:P36 ?capital. }
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en".
}
}
ORDER BY DESC(?population)
LIMIT 50
"""
results = await wikidata_query(sparql)
df = pd.DataFrame(results)
return df
async def get_tech_companies() -> pd.DataFrame:
"""Get major tech companies with founding year and HQ from Wikidata."""
sparql = """
SELECT ?company ?companyLabel ?founded ?hq ?hqLabel ?employees WHERE {
?company wdt:P31 wd:Q4830453. # Business enterprise
?company wdt:P452 wd:Q11032. # Industry: computer software
OPTIONAL { ?company wdt:P571 ?founded. }
OPTIONAL { ?company wdt:P159 ?hq. }
OPTIONAL { ?company wdt:P1128 ?employees. }
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en".
}
FILTER(BOUND(?employees) && ?employees > 1000)
}
ORDER BY DESC(?employees)
LIMIT 30
"""
results = await wikidata_query(sparql)
return pd.DataFrame(results)
async def main_wikidata():
print("── Top 20 countries by population (Wikidata) ──")
countries = await get_country_data()
print(countries[["countryLabel", "population", "gdp"]].head(20))
countries.to_csv("wikidata_countries.csv", index=False)
print("\n── Major tech companies (Wikidata) ──")
companies = await get_tech_companies()
print(companies[["companyLabel", "founded", "hqLabel", "employees"]].head(15))
companies.to_csv("wikidata_tech_companies.csv", index=False)
asyncio.run(main_wikidata())
Part 3: India's Open Government Data — data.gov.in
The datagovindia library is a wrapper around 100,000+ APIs of the Government of India's open data platform data.gov.in. Its functionality centres around three aspects: API discovery — finding the right API from all available APIs; API information — getting details about a particular API; and querying the API — getting a tidy dataset from the chosen API.
This is an extraordinary resource for Indian developers and researchers — census data, agricultural statistics, health data, economic indicators, transport data — all free, all official, all available via a consistent API.
pip install datagovindia pandas
# Get your free API key at: data.gov.in/user/register
from datagovindia import DataGovIndia
import pandas as pd
# Initialise with your API key
dgi = DataGovIndia(api_key="YOUR_DATA_GOV_IN_API_KEY")
# ── Discovery: Find relevant datasets ──────────────────────────
def search_datasets(keyword: str, limit: int = 10) -> pd.DataFrame:
"""Search for government datasets by keyword."""
results = dgi.search_data(keyword, results=limit)
return pd.DataFrame(results)
# Search for datasets
education_datasets = search_datasets("school enrollment india")
print("── Education datasets on data.gov.in ──")
print(education_datasets[["title", "org_type", "source"]].head(10))
agriculture_datasets = search_datasets("crop production india state")
print("\n── Agriculture datasets ──")
print(agriculture_datasets[["title", "source"]].head(5))
# ── Get dataset info ──────────────────────────────────────────
def get_dataset_info(index_name: str) -> dict:
"""Get metadata about a specific dataset."""
return dgi.get_data_info(index_name)
# ── Download a dataset ────────────────────────────────────────
def download_dataset(index_name: str, limit: int = 1000) -> pd.DataFrame:
"""Download records from a government dataset."""
data = dgi.get_data(index_name, results=limit)
return pd.DataFrame(data)
# Example datasets (index_names from data.gov.in search):
# State-wise school enrollment data
SCHOOL_DATASET = "your-dataset-index-name-from-search"
try:
df = download_dataset(SCHOOL_DATASET)
print(f"\nDownloaded {len(df)} records")
print(df.head())
df.to_csv("india_school_enrollment.csv", index=False)
except Exception as e:
print(f"Dataset access error: {e}")
Direct API access (no wrapper library)
The Open Government Data Platform India offers a collection of APIs that provide access to open datasets. Users can integrate these APIs into their applications to retrieve and utilize public data, enhancing data-driven solutions and innovations.
import httpx
import asyncio
import pandas as pd
DATA_GOV_BASE = "https://api.data.gov.in/resource"
API_KEY = "YOUR_DATA_GOV_IN_API_KEY"
async def fetch_gov_dataset(
resource_id: str,
limit: int = 100,
offset: int = 0,
filters: dict = None
) -> dict:
"""
Fetch records from a data.gov.in dataset.
Args:
resource_id: Dataset UUID from data.gov.in (visible in URL)
limit: Records per request (max 100)
offset: Pagination offset
filters: Dict of field:value filters, e.g. {"State": "Maharashtra"}
"""
params = {
"api-key": API_KEY,
"format": "json",
"limit": limit,
"offset": offset,
}
if filters:
for field, value in filters.items():
params[f"filters[{field}]"] = value
url = f"{DATA_GOV_BASE}/{resource_id}"
async with httpx.AsyncClient() as client:
r = await client.get(url, params=params, timeout=20)
r.raise_for_status()
return r.json()
async def download_full_dataset(resource_id: str, max_records: int = 5000) -> pd.DataFrame:
"""
Download an entire dataset by paginating through all records.
"""
all_records = []
offset = 0
page_size = 100
while len(all_records) < max_records:
data = await fetch_gov_dataset(resource_id, limit=page_size, offset=offset)
records = data.get("records", [])
if not records:
break
all_records.extend(records)
total = int(data.get("total", 0))
print(f" Downloaded {len(all_records)}/{min(total, max_records)} records...")
if len(all_records) >= total or len(all_records) >= max_records:
break
offset += page_size
await asyncio.sleep(0.5) # Polite delay
df = pd.DataFrame(all_records)
print(f" Total: {len(df)} records in {len(df.columns)} columns")
return df
# Real dataset example: India's consumer price index data
# Resource IDs can be found on data.gov.in dataset pages
# Format: https://api.data.gov.in/resource/RESOURCE-UUID?api-key=KEY&format=json
async def main_datagov():
# Example: Population census data (replace with actual resource ID)
CENSUS_RESOURCE_ID = "9ef84268-d588-465a-a308-a864a43d0070"
print("Downloading India census data...")
df = await download_full_dataset(CENSUS_RESOURCE_ID, max_records=500)
if not df.empty:
print(f"\nDataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(df.head())
df.to_csv("india_census_data.csv", index=False)
asyncio.run(main_datagov())
Part 4: US Federal Data — data.gov
The US government's data.gov platform hosts hundreds of thousands of datasets from every federal agency — health, transportation, energy, agriculture, economy, education. All free, all open.
import httpx
import asyncio
import pandas as pd
DATAGOV_API = "https://catalog.data.gov/api/3/action"
async def search_us_datasets(
query: str,
limit: int = 10,
organization: str = None
) -> pd.DataFrame:
"""
Search the US data.gov catalog.
Args:
query: Search terms
limit: Number of results
organization: Filter by agency, e.g. "census-bureau", "cdc-gov"
"""
params = {
"q": query,
"rows": limit,
"sort": "score desc",
}
if organization:
params["fq"] = f"organization:{organization}"
async with httpx.AsyncClient() as client:
r = await client.get(
f"{DATAGOV_API}/package_search",
params=params,
timeout=15
)
data = r.json()
datasets = []
for item in data.get("result", {}).get("results", []):
datasets.append({
"name": item.get("name"),
"title": item.get("title"),
"organization": item.get("organization", {}).get("title"),
"description": (item.get("notes") or "")[:200],
"formats": [r.get("format") for r in item.get("resources", [])],
"url": f"https://catalog.data.gov/dataset/{item.get('name')}",
"downloads": item.get("tracking_summary", {}).get("total", 0),
})
return pd.DataFrame(datasets)
async def download_datagov_csv(resource_url: str) -> pd.DataFrame:
"""Download a CSV dataset directly from data.gov."""
async with httpx.AsyncClient(follow_redirects=True) as client:
r = await client.get(resource_url, timeout=60)
r.raise_for_status()
import io
df = pd.read_csv(io.StringIO(r.text))
print(f"Downloaded: {df.shape[0]} rows × {df.shape[1]} columns")
return df
async def get_dataset_resources(dataset_name: str) -> list[dict]:
"""Get all downloadable resources for a dataset."""
async with httpx.AsyncClient() as client:
r = await client.get(
f"{DATAGOV_API}/package_show",
params={"id": dataset_name},
timeout=15
)
data = r.json()
resources = []
for res in data.get("result", {}).get("resources", []):
resources.append({
"name": res.get("name"),
"format": res.get("format"),
"url": res.get("url"),
"size": res.get("size"),
})
return resources
async def main_datagov_us():
# Search for health datasets
print("── Searching data.gov for COVID datasets ──")
health_df = await search_us_datasets("COVID vaccination rates state", limit=5)
print(health_df[["title", "organization", "formats"]].to_string(index=False))
# Search for economic datasets
print("\n── Economic datasets from Census Bureau ──")
econ_df = await search_us_datasets(
"employment statistics",
limit=5,
organization="census-bureau"
)
print(econ_df[["title", "description"]].to_string(index=False))
asyncio.run(main_datagov_us())
Part 5: Combining Wikipedia + Government Data
The real power comes from joining multiple open sources. Here's an example combining Wikipedia infobox data with official government statistics:
import asyncio
import pandas as pd
async def build_india_states_dataset() -> pd.DataFrame:
"""
Build a comprehensive dataset of Indian states by combining:
1. Wikipedia infoboxes (area, founded, capital)
2. data.gov.in (official population, literacy, GDP)
"""
INDIAN_STATES = [
"Maharashtra", "Uttar Pradesh", "Karnataka", "Tamil Nadu",
"Gujarat", "Rajasthan", "West Bengal", "Andhra Pradesh",
"Telangana", "Kerala", "Madhya Pradesh", "Bihar"
]
# ── Wikipedia data ────────────────────────────────────────
print("Fetching Wikipedia infobox data...")
wiki_records = []
for state in INDIAN_STATES:
html = fetch_wikipedia_html(f"{state}_state")
infobox = parse_infobox(html)
wiki_records.append({
"state": state,
"capital": infobox.get("Capital") or infobox.get("Capital city"),
"area_km2": infobox.get("Area"),
"founded": infobox.get("Formation") or infobox.get("Established"),
"districts":infobox.get("Districts"),
"wiki_url": f"https://en.wikipedia.org/wiki/{state.replace(' ', '_')}",
})
await asyncio.sleep(0.5)
wiki_df = pd.DataFrame(wiki_records)
# ── Merge ─────────────────────────────────────────────────
final_df = wiki_df
final_df.to_csv("india_states_combined.csv", index=False)
print(f"\nBuilt dataset: {len(final_df)} states × {len(final_df.columns)} columns")
print(final_df[["state", "capital", "area_km2"]].to_string(index=False))
return final_df
asyncio.run(build_india_states_dataset())
Wikipedia Database Dumps (For Bulk Access)
For extremely large-scale research (millions of articles), use Wikipedia's official database dumps instead of the API:
# Download the latest English Wikipedia article dump (~22GB compressed)
# Use dumps.wikimedia.org — updated monthly
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
# For just abstracts and titles (~1GB) — much more practical
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml.gz
import gzip
import xml.etree.ElementTree as ET
import pandas as pd
def parse_wikipedia_abstracts(dump_file: str, max_articles: int = 10000) -> pd.DataFrame:
"""
Parse the Wikipedia abstracts dump into a DataFrame.
Much faster than API calls for large datasets.
"""
records = []
with gzip.open(dump_file, "rb") as f:
for event, elem in ET.iterparse(f, events=["end"]):
if elem.tag == "doc" and len(records) < max_articles:
title = elem.findtext("title", "").replace("Wikipedia: ", "")
url = elem.findtext("url", "")
abstract = elem.findtext("abstract", "")
if title and abstract and len(abstract) > 50:
records.append({
"title": title,
"url": url,
"abstract": abstract[:500],
"length": len(abstract),
})
elem.clear() # Free memory
df = pd.DataFrame(records)
print(f"Parsed {len(df):,} articles from dump")
return df
# df = parse_wikipedia_abstracts("enwiki-latest-abstract.xml.gz", max_articles=100000)
# df.to_parquet("wikipedia_abstracts.parquet") # Parquet for efficient storage
FAQ
Q: Is scraping Wikipedia legal? The provision permits the extraction of public data, though importantly, this should have a minimal load on Wikipedia's servers and must not disrupt the site's operation or robots.txt. Wikipedia's content is published under Creative Commons Attribution-ShareAlike 4.0, which allows free use including commercial applications. Always use the API rather than raw scraping, and add a meaningful User-Agent.
Q: How do I find the right data.gov.in resource ID? Go to data.gov.in, search for your topic, click a dataset, and look at the API URL shown on the dataset page. The UUID in the URL is your resource ID.
Q: What is Wikidata and how is it different from Wikipedia? Wikipedia contains articles (human-readable text). Wikidata contains structured facts (machine-readable data) — dates, quantities, relationships, identifiers. Every Wikipedia article links to a Wikidata entity. The SPARQL query interface lets you query across all of Wikidata at once.
Q: Can I use Wikipedia data for commercial AI training? Yes — Wikipedia's CC BY-SA 4.0 licence allows commercial use. You must attribute Wikipedia and share your derivative work under the same licence. Many major LLMs include Wikipedia in their training data.
Q: What's the politest way to scrape Wikipedia at scale? Use the official API. Set a descriptive User-Agent. Add 0.5–1 second delays between requests. For bulk access (millions of articles), use database dumps — they put zero load on Wikipedia's servers.
Summary
Source | What you get | Method | Licence |
|---|---|---|---|
Wikipedia API | Article text, summaries, links |
| CC BY-SA 4.0 |
Wikipedia HTML | Infoboxes, tables, images | BeautifulSoup | CC BY-SA 4.0 |
Wikidata SPARQL | Structured facts, relationships | SPARQL query | CC0 (public domain) |
Wikipedia dumps | Millions of articles | XML parsing | CC BY-SA 4.0 |
data.gov.in | 100,000+ India govt datasets |
| NDSAP (open, free) |
data.gov (US) | 300,000+ US federal datasets | REST API | Mostly public domain |
Comments (0)
Login to post a comment.