Which topics does this article cover?

It highlights python wikipedia scraping, scrape wikipedia python tutorial, wikipedia api python, data.gov.in python api tutorial, open government data python.

Scraping Wikipedia & Open Government Data with Python (2026 Complete Guide)

Q: What is "Scraping Wikipedia & Open Government Data with Python (2026 Complete Guide)" about?

Wikipedia and government data platforms offer some of the cleanest data on the internet. This guide covers article extraction, infobox parsing, knowledge graphs, Wikidata queries, and large-scale public datasets from India and the United States.

Introduction: The Cleanest Data on the Internet

Every blog in this series so far has dealt with the hard problem: sites that don't want you scraping them. Anti-bot systems, aggressive rate limiting, JavaScript-rendered content, accounts that get banned, legal grey areas.

Wikipedia and open government data platforms are the complete opposite. They're designed to be scraped. They want you to use their data. They provide official APIs, detailed documentation, and explicit licensing that allows free commercial and non-commercial use. Wikipedia's content is published under Creative Commons Attribution-ShareAlike. The Indian government's data.gov.in platform explicitly invites developers to build on its 100,000+ APIs. The US data.gov provides open access to hundreds of thousands of federal datasets.

This is the most beginner-friendly blog in the entire series — but the data available through these sources is extraordinarily powerful. Wikipedia alone contains structured information on over 6.7 million topics in English, with infoboxes full of structured data, tables full of statistics, and cross-linked articles ideal for knowledge graph construction.

By the end of this guide you'll be able to:

Use the Wikipedia API and wikipedia-api library to extract articles, summaries, and links
Parse Wikipedia infoboxes and tables into clean pandas DataFrames
Build a multi-article knowledge scraper with link following
Access India's data.gov.in platform and its 100,000+ government datasets
Download and analyse US federal datasets from data.gov
Combine Wikipedia and government data into a unified research dataset

Part 1: Wikipedia — Four Ways to Access Data

Wikipedia offers four distinct access methods, each suited to different use cases:

Method	Best for	Rate limit	Effort
`wikipedia-api` library	Article text, summaries, links	Generous	Very Low
MediaWiki REST API	Any structured query	Generous	Low
`wptools` library	Infoboxes, structured data	Generous	Low
Direct HTML scraping	Tables, custom extraction	Generous	Medium
Database dumps	Bulk data (millions of articles)	None	High

Wikipedia's robots.txt explicitly allows scraping for research use. Their only request: use the API rather than raw HTML scraping wherever possible, and add a descriptive User-Agent.

Method 1: The `wikipedia-api` Library (Easiest)

Extracting properly organized data from Wikipedia can simplify and speed up your research, NLP training, or content scraping processes. With just a few libraries — wikipedia, BeautifulSoup, and pandas — you can transform unstructured encyclopedia content into usable data.

pip install wikipedia-api pandas

import wikipediaapi

# Create a Wikipedia API client
# Always include a descriptive User-Agent with contact info
wiki = wikipediaapi.Wikipedia(
    language="en",
    user_agent="ResearchBot/1.0 (contact@yourdomain.com; educational use)"
)

def get_article(title: str) -> dict | None:
    """
    Fetch a Wikipedia article by title.
    Returns structured data including summary, full text, links, and sections.
    """
    page = wiki.page(title)

    if not page.exists():
        print(f"Article not found: '{title}'")
        return None

    return {
        "title":      page.title,
        "url":        page.fullurl,
        "summary":    page.summary,            # First 2–3 paragraphs
        "full_text":  page.text,               # Complete article text
        "word_count": len(page.text.split()),
        "sections":   [s.title for s in page.sections],
        "links":      list(page.links.keys())[:50],   # First 50 linked articles
        "categories": list(page.categories.keys())[:20],
        "language":   page.language,
    }


# Single article
article = get_article("Machine Learning")
if article:
    print(f"Title:      {article['title']}")
    print(f"Words:      {article['word_count']:,}")
    print(f"Sections:   {', '.join(article['sections'][:5])}")
    print(f"Summary:    {article['summary'][:300]}...")
    print(f"Links to:   {', '.join(article['links'][:8])}")

Batch fetching multiple articles

import time
import pandas as pd

def batch_fetch_articles(titles: list[str], delay: float = 0.5) -> pd.DataFrame:
    """
    Fetch multiple Wikipedia articles and return as a DataFrame.
    Includes polite delay between requests.
    """
    records = []

    for i, title in enumerate(titles):
        print(f"[{i+1}/{len(titles)}] Fetching: {title}")
        article = get_article(title)

        if article:
            records.append({
                "title":        article["title"],
                "url":          article["url"],
                "word_count":   article["word_count"],
                "sections":     len(article["sections"]),
                "links":        len(article["links"]),
                "summary":      article["summary"][:500],
            })

        time.sleep(delay)   # Wikipedia asks for polite delays

    return pd.DataFrame(records)


# Fetch articles on Python scraping ecosystem
topics = [
    "Web scraping", "BeautifulSoup", "Scrapy (software)",
    "Playwright (software)", "Selenium (software)",
    "XPath", "CSS", "HTML", "JSON", "Python (programming language)"
]

df = batch_fetch_articles(topics)
df.to_csv("wikipedia_articles.csv", index=False)
print(df[["title", "word_count", "sections", "links"]])

Method 2: The MediaWiki REST API (Most Powerful)

The MediaWiki API is Wikipedia's official programmatic interface. It provides far more control than the wikipedia-api library — you can query search results, get page metadata, fetch revision history, and access structured Wikidata.

import httpx
import asyncio
import pandas as pd

WIKI_API = "https://en.wikipedia.org/w/api.php"
HEADERS  = {
    "User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"
}

async def wiki_api_request(params: dict) -> dict:
    """Make a request to the MediaWiki API."""
    params["format"] = "json"
    async with httpx.AsyncClient(headers=HEADERS) as client:
        r = await client.get(WIKI_API, params=params, timeout=15)
        r.raise_for_status()
    return r.json()


async def search_wikipedia(query: str, limit: int = 10) -> list[dict]:
    """Full-text search across Wikipedia."""
    data = await wiki_api_request({
        "action":   "query",
        "list":     "search",
        "srsearch": query,
        "srlimit":  limit,
        "srprop":   "snippet|titlesnippet|size|wordcount",
    })

    results = []
    for item in data.get("query", {}).get("search", []):
        # Strip HTML tags from snippet
        import re
        snippet = re.sub(r"<[^>]+>", "", item.get("snippet", ""))
        results.append({
            "title":      item["title"],
            "snippet":    snippet,
            "word_count": item.get("wordcount", 0),
            "size_bytes": item.get("size", 0),
            "url":        f"https://en.wikipedia.org/wiki/{item['title'].replace(' ', '_')}",
        })

    return results


async def get_page_summary(title: str) -> dict:
    """Get a concise summary using Wikipedia's REST summary endpoint."""
    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title.replace(' ', '_')}"
    async with httpx.AsyncClient(headers=HEADERS) as client:
        r = await client.get(url, timeout=10)
        if r.status_code == 404:
            return {}
        data = r.json()

    return {
        "title":       data.get("title"),
        "description": data.get("description"),
        "extract":     data.get("extract"),        # Plain text summary
        "thumbnail":   data.get("thumbnail", {}).get("source"),
        "url":         data.get("content_urls", {}).get("desktop", {}).get("page"),
        "wikidata_id": data.get("wikibase_item"),  # For Wikidata queries
    }


async def get_page_links(title: str, limit: int = 100) -> list[str]:
    """Get all internal links from a Wikipedia page."""
    data = await wiki_api_request({
        "action":    "query",
        "titles":    title,
        "prop":      "links",
        "pllimit":   limit,
        "plnamespace": 0,   # Only article namespace (not talk pages etc.)
    })

    pages  = data.get("query", {}).get("pages", {})
    links  = []
    for page_data in pages.values():
        for link in page_data.get("links", []):
            links.append(link["title"])

    return links


async def get_page_images(title: str) -> list[dict]:
    """Get all images used in a Wikipedia article."""
    data = await wiki_api_request({
        "action":  "query",
        "titles":  title,
        "prop":    "images",
        "imlimit": 50,
    })

    pages  = data.get("query", {}).get("pages", {})
    images = []
    for page_data in pages.values():
        for img in page_data.get("images", []):
            name = img["title"].replace("File:", "")
            if any(name.lower().endswith(ext) for ext in [".jpg", ".png", ".svg", ".gif"]):
                images.append({
                    "filename": name,
                    "wiki_url": f"https://commons.wikimedia.org/wiki/File:{name.replace(' ', '_')}",
                })

    return images


async def main_api():
    # Search
    print("── Search results for 'python web scraping' ──")
    results = await search_wikipedia("python web scraping", limit=5)
    for r in results:
        print(f"  {r['title']}: {r['snippet'][:80]}...")

    # Summary
    print("\n── Page summary ──")
    summary = await get_page_summary("Web scraping")
    print(f"  {summary['title']}: {summary['extract'][:200]}...")

    # Links
    print("\n── Top links from 'Web scraping' article ──")
    links = await get_page_links("Web scraping", limit=20)
    print(f"  {', '.join(links[:10])}")

asyncio.run(main_api())

Method 3: Extracting Wikipedia Infoboxes

Infoboxes are the structured data panels on the right side of Wikipedia articles — they contain the most machine-readable data on Wikipedia: population, area, GDP, founding date, CEO, headquarters, etc.

You can use BeautifulSoup to find and parse infobox content, transforming unstructured web content into structured datasets useful for NLP training, trend analysis, or data journalism.

import httpx
import pandas as pd
from bs4 import BeautifulSoup
import re

HEADERS = {"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"}

def fetch_wikipedia_html(title: str) -> str:
    """Fetch the raw HTML of a Wikipedia page."""
    url = f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}"
    r   = httpx.get(url, headers=HEADERS, follow_redirects=True, timeout=15)
    r.raise_for_status()
    return r.text


def parse_infobox(html: str) -> dict:
    """
    Extract key-value pairs from a Wikipedia infobox.
    Works with most infobox types: country, person, company, city, film, etc.
    """
    soup    = BeautifulSoup(html, "lxml")
    infobox = soup.select_one("table.infobox")

    if not infobox:
        return {}

    data = {}
    for row in infobox.select("tr"):
        # Standard two-column layout: th = label, td = value
        label_el = row.select_one("th")
        value_el  = row.select_one("td")

        if label_el and value_el:
            label = label_el.get_text(separator=" ", strip=True)
            # Strip footnote numbers like [1], [2]
            value = re.sub(r"\[\d+\]", "", value_el.get_text(separator=" ", strip=True))
            value = re.sub(r"\s+", " ", value).strip()

            if label and value and len(label) < 80:
                data[label] = value

    return data


def parse_all_tables(html: str) -> list[pd.DataFrame]:
    """
    Extract all wikitables from a Wikipedia page as DataFrames.
    Great for statistical tables, comparison tables, rankings.
    """
    try:
        tables = pd.read_html(html, flavor="lxml")
        return tables
    except Exception:
        return []


# Example: Extract country data from Wikipedia infoboxes
countries = ["India", "China", "United_States", "Brazil", "Germany"]

country_data = []
for country in countries:
    print(f"Fetching: {country}...")
    html     = fetch_wikipedia_html(country)
    infobox  = parse_infobox(html)
    tables   = parse_all_tables(html)

    # Standardise common infobox fields
    country_data.append({
        "country":      country,
        "capital":      infobox.get("Capital") or infobox.get("Capital city"),
        "population":   infobox.get("Population"),
        "area":         infobox.get("Area"),
        "gdp_nominal":  infobox.get("GDP (nominal)") or infobox.get("GDP (PPP)"),
        "currency":     infobox.get("Currency"),
        "language":     infobox.get("Official languages"),
        "government":   infobox.get("Government"),
        "tables_found": len(tables),
    })
    time.sleep(0.5)

df = pd.DataFrame(country_data)
print(df.to_string(index=False))
df.to_csv("country_infoboxes.csv", index=False)

Extracting specific tables (rankings, statistics)

def get_wikipedia_table(title: str, table_index: int = 0, match_text: str = None) -> pd.DataFrame | None:
    """
    Extract a specific table from a Wikipedia page.

    Args:
        title: Wikipedia article title
        table_index: Which table to return (0 = first)
        match_text: Return the first table whose header contains this text
    """
    html    = fetch_wikipedia_html(title)
    tables  = parse_all_tables(html)

    if not tables:
        print(f"No tables found on '{title}'")
        return None

    if match_text:
        for t in tables:
            # Check if any column header contains match_text
            if any(match_text.lower() in str(col).lower() for col in t.columns):
                return t
        print(f"No table with '{match_text}' found")
        return None

    if table_index >= len(tables):
        print(f"Table index {table_index} out of range (found {len(tables)} tables)")
        return None

    return tables[table_index]


# Get the list of largest cities by population
cities_df = get_wikipedia_table(
    "List of largest cities",
    match_text="Population"
)
if cities_df is not None:
    print(f"Found table with {len(cities_df)} rows")
    print(cities_df.head(10))
    cities_df.to_csv("largest_cities.csv", index=False)

# Get Nobel Prize winners table
nobel_df = get_wikipedia_table("List of Nobel laureates", table_index=0)
if nobel_df is not None:
    print(f"\nNobel laureates table: {len(nobel_df)} entries")
    print(nobel_df.head(5))

Method 4: Building a Wikipedia Knowledge Graph Scraper

Wikipedia's internal links create a natural knowledge graph. Following links allows you to scrape entire topic clusters:

import asyncio
import httpx
import json
import time
from collections import deque

HEADERS = {"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"}

async def build_knowledge_graph(
    seed_topic: str,
    max_articles: int = 50,
    max_depth: int = 2,
    language: str = "en"
) -> dict:
    """
    Build a knowledge graph by crawling Wikipedia starting from a seed topic.
    Follows internal links up to max_depth levels.

    Returns: {
        "nodes": [{title, summary, url, depth}],
        "edges": [(source_title, target_title)]
    }
    """
    nodes  = {}
    edges  = []
    queue  = deque([(seed_topic, 0)])
    visited = set()

    async with httpx.AsyncClient(headers=HEADERS) as client:
        while queue and len(nodes) < max_articles:
            title, depth = queue.popleft()

            if title in visited or depth > max_depth:
                continue
            visited.add(title)

            print(f"  [depth={depth}] {title} ({len(nodes)}/{max_articles})")

            # Fetch summary
            try:
                r = await client.get(
                    f"https://en.wikipedia.org/api/rest_v1/page/summary/"
                    f"{title.replace(' ', '_')}",
                    timeout=10
                )
                if r.status_code != 200:
                    continue
                data = r.json()
            except Exception:
                continue

            nodes[title] = {
                "title":       data.get("title", title),
                "description": data.get("description", ""),
                "summary":     (data.get("extract") or "")[:300],
                "url":         data.get("content_urls", {})
                                   .get("desktop", {}).get("page", ""),
                "depth":       depth,
            }

            # Fetch links if not at max depth
            if depth < max_depth:
                links_r = await client.get(
                    "https://en.wikipedia.org/w/api.php",
                    params={
                        "action": "query", "titles": title,
                        "prop": "links", "pllimit": 20,
                        "plnamespace": 0, "format": "json"
                    },
                    timeout=10
                )
                links_data = links_r.json()
                pages      = links_data.get("query", {}).get("pages", {})

                for page_data in pages.values():
                    for link in page_data.get("links", []):
                        link_title = link["title"]
                        edges.append((title, link_title))
                        if link_title not in visited:
                            queue.append((link_title, depth + 1))

            await asyncio.sleep(0.3)   # Polite delay

    return {
        "seed":   seed_topic,
        "nodes":  list(nodes.values()),
        "edges":  edges[:500],   # Limit edges for manageability
        "stats":  {
            "total_articles": len(nodes),
            "total_edges":    len(edges),
            "max_depth":      max_depth,
        }
    }


async def main_graph():
    print("Building knowledge graph for 'Machine Learning'...")
    graph = await build_knowledge_graph(
        seed_topic="Machine learning",
        max_articles=30,
        max_depth=2
    )

    print(f"\nGraph built:")
    print(f"  Articles: {graph['stats']['total_articles']}")
    print(f"  Connections: {graph['stats']['total_edges']}")

    # Save as JSON for use in graph tools (NetworkX, Gephi, D3.js)
    with open("knowledge_graph.json", "w") as f:
        json.dump(graph, f, indent=2, ensure_ascii=False)

    # Save nodes as CSV for analysis
    nodes_df = pd.DataFrame(graph["nodes"])
    nodes_df.to_csv("knowledge_graph_nodes.csv", index=False)
    print(f"\nSaved to knowledge_graph.json and knowledge_graph_nodes.csv")

asyncio.run(main_graph())

Part 2: Wikidata — Structured Data from Wikipedia

Wikidata is the structured data backbone behind Wikipedia — a giant knowledge base of facts in machine-readable form. Every Wikipedia article links to a Wikidata entity with typed properties.

import httpx
import asyncio
import pandas as pd

WIKIDATA_SPARQL = "https://query.wikidata.org/sparql"
HEADERS = {
    "User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)",
    "Accept": "application/sparql-results+json"
}

async def wikidata_query(sparql: str) -> list[dict]:
    """
    Execute a SPARQL query against Wikidata.
    Returns a list of result dicts.
    """
    async with httpx.AsyncClient(headers=HEADERS) as client:
        r = await client.get(
            WIKIDATA_SPARQL,
            params={"query": sparql, "format": "json"},
            timeout=30
        )
        r.raise_for_status()
        data = r.json()

    bindings = data.get("results", {}).get("bindings", [])
    results  = []
    for row in bindings:
        results.append({
            key: val.get("value") for key, val in row.items()
        })
    return results


async def get_country_data() -> pd.DataFrame:
    """Get population and GDP data for all countries from Wikidata."""
    sparql = """
    SELECT ?country ?countryLabel ?population ?gdp ?capital ?capitalLabel WHERE {
        ?country wdt:P31 wd:Q3624078.      # Instance of: sovereign state
        OPTIONAL { ?country wdt:P1082 ?population. }
        OPTIONAL { ?country wdt:P2131 ?gdp. }
        OPTIONAL { ?country wdt:P36 ?capital. }
        SERVICE wikibase:label {
            bd:serviceParam wikibase:language "en".
        }
    }
    ORDER BY DESC(?population)
    LIMIT 50
    """
    results = await wikidata_query(sparql)
    df = pd.DataFrame(results)
    return df


async def get_tech_companies() -> pd.DataFrame:
    """Get major tech companies with founding year and HQ from Wikidata."""
    sparql = """
    SELECT ?company ?companyLabel ?founded ?hq ?hqLabel ?employees WHERE {
        ?company wdt:P31 wd:Q4830453.      # Business enterprise
        ?company wdt:P452 wd:Q11032.       # Industry: computer software
        OPTIONAL { ?company wdt:P571 ?founded. }
        OPTIONAL { ?company wdt:P159 ?hq. }
        OPTIONAL { ?company wdt:P1128 ?employees. }
        SERVICE wikibase:label {
            bd:serviceParam wikibase:language "en".
        }
        FILTER(BOUND(?employees) && ?employees > 1000)
    }
    ORDER BY DESC(?employees)
    LIMIT 30
    """
    results = await wikidata_query(sparql)
    return pd.DataFrame(results)


async def main_wikidata():
    print("── Top 20 countries by population (Wikidata) ──")
    countries = await get_country_data()
    print(countries[["countryLabel", "population", "gdp"]].head(20))
    countries.to_csv("wikidata_countries.csv", index=False)

    print("\n── Major tech companies (Wikidata) ──")
    companies = await get_tech_companies()
    print(companies[["companyLabel", "founded", "hqLabel", "employees"]].head(15))
    companies.to_csv("wikidata_tech_companies.csv", index=False)

asyncio.run(main_wikidata())

Part 3: India's Open Government Data — data.gov.in

The datagovindia library is a wrapper around 100,000+ APIs of the Government of India's open data platform data.gov.in. Its functionality centres around three aspects: API discovery — finding the right API from all available APIs; API information — getting details about a particular API; and querying the API — getting a tidy dataset from the chosen API.

This is an extraordinary resource for Indian developers and researchers — census data, agricultural statistics, health data, economic indicators, transport data — all free, all official, all available via a consistent API.

pip install datagovindia pandas
# Get your free API key at: data.gov.in/user/register

from datagovindia import DataGovIndia
import pandas as pd

# Initialise with your API key
dgi = DataGovIndia(api_key="YOUR_DATA_GOV_IN_API_KEY")

# ── Discovery: Find relevant datasets ──────────────────────────
def search_datasets(keyword: str, limit: int = 10) -> pd.DataFrame:
    """Search for government datasets by keyword."""
    results = dgi.search_data(keyword, results=limit)
    return pd.DataFrame(results)

# Search for datasets
education_datasets = search_datasets("school enrollment india")
print("── Education datasets on data.gov.in ──")
print(education_datasets[["title", "org_type", "source"]].head(10))

agriculture_datasets = search_datasets("crop production india state")
print("\n── Agriculture datasets ──")
print(agriculture_datasets[["title", "source"]].head(5))

# ── Get dataset info ──────────────────────────────────────────
def get_dataset_info(index_name: str) -> dict:
    """Get metadata about a specific dataset."""
    return dgi.get_data_info(index_name)

# ── Download a dataset ────────────────────────────────────────
def download_dataset(index_name: str, limit: int = 1000) -> pd.DataFrame:
    """Download records from a government dataset."""
    data = dgi.get_data(index_name, results=limit)
    return pd.DataFrame(data)


# Example datasets (index_names from data.gov.in search):

# State-wise school enrollment data
SCHOOL_DATASET = "your-dataset-index-name-from-search"

try:
    df = download_dataset(SCHOOL_DATASET)
    print(f"\nDownloaded {len(df)} records")
    print(df.head())
    df.to_csv("india_school_enrollment.csv", index=False)
except Exception as e:
    print(f"Dataset access error: {e}")

Direct API access (no wrapper library)

The Open Government Data Platform India offers a collection of APIs that provide access to open datasets. Users can integrate these APIs into their applications to retrieve and utilize public data, enhancing data-driven solutions and innovations.

import httpx
import asyncio
import pandas as pd

DATA_GOV_BASE = "https://api.data.gov.in/resource"
API_KEY       = "YOUR_DATA_GOV_IN_API_KEY"

async def fetch_gov_dataset(
    resource_id: str,
    limit: int = 100,
    offset: int = 0,
    filters: dict = None
) -> dict:
    """
    Fetch records from a data.gov.in dataset.

    Args:
        resource_id: Dataset UUID from data.gov.in (visible in URL)
        limit: Records per request (max 100)
        offset: Pagination offset
        filters: Dict of field:value filters, e.g. {"State": "Maharashtra"}
    """
    params = {
        "api-key": API_KEY,
        "format":  "json",
        "limit":   limit,
        "offset":  offset,
    }

    if filters:
        for field, value in filters.items():
            params[f"filters[{field}]"] = value

    url = f"{DATA_GOV_BASE}/{resource_id}"
    async with httpx.AsyncClient() as client:
        r = await client.get(url, params=params, timeout=20)
        r.raise_for_status()
    return r.json()


async def download_full_dataset(resource_id: str, max_records: int = 5000) -> pd.DataFrame:
    """
    Download an entire dataset by paginating through all records.
    """
    all_records = []
    offset      = 0
    page_size   = 100

    while len(all_records) < max_records:
        data = await fetch_gov_dataset(resource_id, limit=page_size, offset=offset)

        records = data.get("records", [])
        if not records:
            break

        all_records.extend(records)
        total = int(data.get("total", 0))

        print(f"  Downloaded {len(all_records)}/{min(total, max_records)} records...")

        if len(all_records) >= total or len(all_records) >= max_records:
            break

        offset += page_size
        await asyncio.sleep(0.5)   # Polite delay

    df = pd.DataFrame(all_records)
    print(f"  Total: {len(df)} records in {len(df.columns)} columns")
    return df


# Real dataset example: India's consumer price index data
# Resource IDs can be found on data.gov.in dataset pages
# Format: https://api.data.gov.in/resource/RESOURCE-UUID?api-key=KEY&format=json

async def main_datagov():
    # Example: Population census data (replace with actual resource ID)
    CENSUS_RESOURCE_ID = "9ef84268-d588-465a-a308-a864a43d0070"

    print("Downloading India census data...")
    df = await download_full_dataset(CENSUS_RESOURCE_ID, max_records=500)

    if not df.empty:
        print(f"\nDataset shape: {df.shape}")
        print(f"Columns: {list(df.columns)}")
        print(df.head())
        df.to_csv("india_census_data.csv", index=False)

asyncio.run(main_datagov())

Part 4: US Federal Data — data.gov

The US government's data.gov platform hosts hundreds of thousands of datasets from every federal agency — health, transportation, energy, agriculture, economy, education. All free, all open.

import httpx
import asyncio
import pandas as pd

DATAGOV_API = "https://catalog.data.gov/api/3/action"

async def search_us_datasets(
    query: str,
    limit: int = 10,
    organization: str = None
) -> pd.DataFrame:
    """
    Search the US data.gov catalog.

    Args:
        query: Search terms
        limit: Number of results
        organization: Filter by agency, e.g. "census-bureau", "cdc-gov"
    """
    params = {
        "q":     query,
        "rows":  limit,
        "sort":  "score desc",
    }
    if organization:
        params["fq"] = f"organization:{organization}"

    async with httpx.AsyncClient() as client:
        r = await client.get(
            f"{DATAGOV_API}/package_search",
            params=params,
            timeout=15
        )
        data = r.json()

    datasets = []
    for item in data.get("result", {}).get("results", []):
        datasets.append({
            "name":         item.get("name"),
            "title":        item.get("title"),
            "organization": item.get("organization", {}).get("title"),
            "description":  (item.get("notes") or "")[:200],
            "formats":      [r.get("format") for r in item.get("resources", [])],
            "url":          f"https://catalog.data.gov/dataset/{item.get('name')}",
            "downloads":    item.get("tracking_summary", {}).get("total", 0),
        })

    return pd.DataFrame(datasets)


async def download_datagov_csv(resource_url: str) -> pd.DataFrame:
    """Download a CSV dataset directly from data.gov."""
    async with httpx.AsyncClient(follow_redirects=True) as client:
        r = await client.get(resource_url, timeout=60)
        r.raise_for_status()

    import io
    df = pd.read_csv(io.StringIO(r.text))
    print(f"Downloaded: {df.shape[0]} rows × {df.shape[1]} columns")
    return df


async def get_dataset_resources(dataset_name: str) -> list[dict]:
    """Get all downloadable resources for a dataset."""
    async with httpx.AsyncClient() as client:
        r = await client.get(
            f"{DATAGOV_API}/package_show",
            params={"id": dataset_name},
            timeout=15
        )
        data = r.json()

    resources = []
    for res in data.get("result", {}).get("resources", []):
        resources.append({
            "name":   res.get("name"),
            "format": res.get("format"),
            "url":    res.get("url"),
            "size":   res.get("size"),
        })
    return resources


async def main_datagov_us():
    # Search for health datasets
    print("── Searching data.gov for COVID datasets ──")
    health_df = await search_us_datasets("COVID vaccination rates state", limit=5)
    print(health_df[["title", "organization", "formats"]].to_string(index=False))

    # Search for economic datasets
    print("\n── Economic datasets from Census Bureau ──")
    econ_df = await search_us_datasets(
        "employment statistics",
        limit=5,
        organization="census-bureau"
    )
    print(econ_df[["title", "description"]].to_string(index=False))

asyncio.run(main_datagov_us())

Part 5: Combining Wikipedia + Government Data

The real power comes from joining multiple open sources. Here's an example combining Wikipedia infobox data with official government statistics:

import asyncio
import pandas as pd

async def build_india_states_dataset() -> pd.DataFrame:
    """
    Build a comprehensive dataset of Indian states by combining:
    1. Wikipedia infoboxes (area, founded, capital)
    2. data.gov.in (official population, literacy, GDP)
    """
    INDIAN_STATES = [
        "Maharashtra", "Uttar Pradesh", "Karnataka", "Tamil Nadu",
        "Gujarat", "Rajasthan", "West Bengal", "Andhra Pradesh",
        "Telangana", "Kerala", "Madhya Pradesh", "Bihar"
    ]

    # ── Wikipedia data ────────────────────────────────────────
    print("Fetching Wikipedia infobox data...")
    wiki_records = []
    for state in INDIAN_STATES:
        html    = fetch_wikipedia_html(f"{state}_state")
        infobox = parse_infobox(html)
        wiki_records.append({
            "state":    state,
            "capital":  infobox.get("Capital") or infobox.get("Capital city"),
            "area_km2": infobox.get("Area"),
            "founded":  infobox.get("Formation") or infobox.get("Established"),
            "districts":infobox.get("Districts"),
            "wiki_url": f"https://en.wikipedia.org/wiki/{state.replace(' ', '_')}",
        })
        await asyncio.sleep(0.5)

    wiki_df = pd.DataFrame(wiki_records)

    # ── Merge ─────────────────────────────────────────────────
    final_df = wiki_df
    final_df.to_csv("india_states_combined.csv", index=False)

    print(f"\nBuilt dataset: {len(final_df)} states × {len(final_df.columns)} columns")
    print(final_df[["state", "capital", "area_km2"]].to_string(index=False))
    return final_df

asyncio.run(build_india_states_dataset())

Wikipedia Database Dumps (For Bulk Access)

For extremely large-scale research (millions of articles), use Wikipedia's official database dumps instead of the API:

# Download the latest English Wikipedia article dump (~22GB compressed)
# Use dumps.wikimedia.org — updated monthly
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

# For just abstracts and titles (~1GB) — much more practical
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml.gz

import gzip
import xml.etree.ElementTree as ET
import pandas as pd

def parse_wikipedia_abstracts(dump_file: str, max_articles: int = 10000) -> pd.DataFrame:
    """
    Parse the Wikipedia abstracts dump into a DataFrame.
    Much faster than API calls for large datasets.
    """
    records = []

    with gzip.open(dump_file, "rb") as f:
        for event, elem in ET.iterparse(f, events=["end"]):
            if elem.tag == "doc" and len(records) < max_articles:
                title    = elem.findtext("title", "").replace("Wikipedia: ", "")
                url      = elem.findtext("url", "")
                abstract = elem.findtext("abstract", "")

                if title and abstract and len(abstract) > 50:
                    records.append({
                        "title":    title,
                        "url":      url,
                        "abstract": abstract[:500],
                        "length":   len(abstract),
                    })

                elem.clear()   # Free memory

    df = pd.DataFrame(records)
    print(f"Parsed {len(df):,} articles from dump")
    return df

# df = parse_wikipedia_abstracts("enwiki-latest-abstract.xml.gz", max_articles=100000)
# df.to_parquet("wikipedia_abstracts.parquet")  # Parquet for efficient storage

FAQ

Q: Is scraping Wikipedia legal? The provision permits the extraction of public data, though importantly, this should have a minimal load on Wikipedia's servers and must not disrupt the site's operation or robots.txt. Wikipedia's content is published under Creative Commons Attribution-ShareAlike 4.0, which allows free use including commercial applications. Always use the API rather than raw scraping, and add a meaningful User-Agent.

Q: How do I find the right data.gov.in resource ID? Go to data.gov.in, search for your topic, click a dataset, and look at the API URL shown on the dataset page. The UUID in the URL is your resource ID.

Q: What is Wikidata and how is it different from Wikipedia? Wikipedia contains articles (human-readable text). Wikidata contains structured facts (machine-readable data) — dates, quantities, relationships, identifiers. Every Wikipedia article links to a Wikidata entity. The SPARQL query interface lets you query across all of Wikidata at once.

Q: Can I use Wikipedia data for commercial AI training? Yes — Wikipedia's CC BY-SA 4.0 licence allows commercial use. You must attribute Wikipedia and share your derivative work under the same licence. Many major LLMs include Wikipedia in their training data.

Q: What's the politest way to scrape Wikipedia at scale? Use the official API. Set a descriptive User-Agent. Add 0.5–1 second delays between requests. For bulk access (millions of articles), use database dumps — they put zero load on Wikipedia's servers.

Summary

Source	What you get	Method	Licence
Wikipedia API	Article text, summaries, links	`wikipedia-api` + MediaWiki API	CC BY-SA 4.0
Wikipedia HTML	Infoboxes, tables, images	BeautifulSoup	CC BY-SA 4.0
Wikidata SPARQL	Structured facts, relationships	SPARQL query	CC0 (public domain)
Wikipedia dumps	Millions of articles	XML parsing	CC BY-SA 4.0
data.gov.in	100,000+ India govt datasets	`datagovindia` + REST API	NDSAP (open, free)
data.gov (US)	300,000+ US federal datasets	REST API	Mostly public domain

Introduction: The Cleanest Data on the Internet

By the end of this guide you'll be able to:

Use the Wikipedia API and wikipedia-api library to extract articles, summaries, and links
Parse Wikipedia infoboxes and tables into clean pandas DataFrames
Build a multi-article knowledge scraper with link following
Access India's data.gov.in platform and its 100,000+ government datasets
Download and analyse US federal datasets from data.gov
Combine Wikipedia and government data into a unified research dataset

Part 1: Wikipedia — Four Ways to Access Data

Wikipedia offers four distinct access methods, each suited to different use cases:

Method	Best for	Rate limit	Effort
`wikipedia-api` library	Article text, summaries, links	Generous	Very Low
MediaWiki REST API	Any structured query	Generous	Low
`wptools` library	Infoboxes, structured data	Generous	Low
Direct HTML scraping	Tables, custom extraction	Generous	Medium
Database dumps	Bulk data (millions of articles)	None	High

Wikipedia's robots.txt explicitly allows scraping for research use. Their only request: use the API rather than raw HTML scraping wherever possible, and add a descriptive User-Agent.

Method 1: The `wikipedia-api` Library (Easiest)

pip install wikipedia-api pandas

import wikipediaapi

# Create a Wikipedia API client
# Always include a descriptive User-Agent with contact info
wiki = wikipediaapi.Wikipedia(
    language="en",
    user_agent="ResearchBot/1.0 (contact@yourdomain.com; educational use)"
)

def get_article(title: str) -> dict | None:
    """
    Fetch a Wikipedia article by title.
    Returns structured data including summary, full text, links, and sections.
    """
    page = wiki.page(title)

    if not page.exists():
        print(f"Article not found: '{title}'")
        return None

    return {
        "title":      page.title,
        "url":        page.fullurl,
        "summary":    page.summary,            # First 2–3 paragraphs
        "full_text":  page.text,               # Complete article text
        "word_count": len(page.text.split()),
        "sections":   [s.title for s in page.sections],
        "links":      list(page.links.keys())[:50],   # First 50 linked articles
        "categories": list(page.categories.keys())[:20],
        "language":   page.language,
    }


# Single article
article = get_article("Machine Learning")
if article:
    print(f"Title:      {article['title']}")
    print(f"Words:      {article['word_count']:,}")
    print(f"Sections:   {', '.join(article['sections'][:5])}")
    print(f"Summary:    {article['summary'][:300]}...")
    print(f"Links to:   {', '.join(article['links'][:8])}")

Batch fetching multiple articles

import time
import pandas as pd

def batch_fetch_articles(titles: list[str], delay: float = 0.5) -> pd.DataFrame:
    """
    Fetch multiple Wikipedia articles and return as a DataFrame.
    Includes polite delay between requests.
    """
    records = []

    for i, title in enumerate(titles):
        print(f"[{i+1}/{len(titles)}] Fetching: {title}")
        article = get_article(title)

        if article:
            records.append({
                "title":        article["title"],
                "url":          article["url"],
                "word_count":   article["word_count"],
                "sections":     len(article["sections"]),
                "links":        len(article["links"]),
                "summary":      article["summary"][:500],
            })

        time.sleep(delay)   # Wikipedia asks for polite delays

    return pd.DataFrame(records)


# Fetch articles on Python scraping ecosystem
topics = [
    "Web scraping", "BeautifulSoup", "Scrapy (software)",
    "Playwright (software)", "Selenium (software)",
    "XPath", "CSS", "HTML", "JSON", "Python (programming language)"
]

df = batch_fetch_articles(topics)
df.to_csv("wikipedia_articles.csv", index=False)
print(df[["title", "word_count", "sections", "links"]])

Method 2: The MediaWiki REST API (Most Powerful)

import httpx
import asyncio
import pandas as pd

WIKI_API = "https://en.wikipedia.org/w/api.php"
HEADERS  = {
    "User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"
}

async def wiki_api_request(params: dict) -> dict:
    """Make a request to the MediaWiki API."""
    params["format"] = "json"
    async with httpx.AsyncClient(headers=HEADERS) as client:
        r = await client.get(WIKI_API, params=params, timeout=15)
        r.raise_for_status()
    return r.json()


async def search_wikipedia(query: str, limit: int = 10) -> list[dict]:
    """Full-text search across Wikipedia."""
    data = await wiki_api_request({
        "action":   "query",
        "list":     "search",
        "srsearch": query,
        "srlimit":  limit,
        "srprop":   "snippet|titlesnippet|size|wordcount",
    })

    results = []
    for item in data.get("query", {}).get("search", []):
        # Strip HTML tags from snippet
        import re
        snippet = re.sub(r"<[^>]+>", "", item.get("snippet", ""))
        results.append({
            "title":      item["title"],
            "snippet":    snippet,
            "word_count": item.get("wordcount", 0),
            "size_bytes": item.get("size", 0),
            "url":        f"https://en.wikipedia.org/wiki/{item['title'].replace(' ', '_')}",
        })

    return results


async def get_page_summary(title: str) -> dict:
    """Get a concise summary using Wikipedia's REST summary endpoint."""
    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title.replace(' ', '_')}"
    async with httpx.AsyncClient(headers=HEADERS) as client:
        r = await client.get(url, timeout=10)
        if r.status_code == 404:
            return {}
        data = r.json()

    return {
        "title":       data.get("title"),
        "description": data.get("description"),
        "extract":     data.get("extract"),        # Plain text summary
        "thumbnail":   data.get("thumbnail", {}).get("source"),
        "url":         data.get("content_urls", {}).get("desktop", {}).get("page"),
        "wikidata_id": data.get("wikibase_item"),  # For Wikidata queries
    }


async def get_page_links(title: str, limit: int = 100) -> list[str]:
    """Get all internal links from a Wikipedia page."""
    data = await wiki_api_request({
        "action":    "query",
        "titles":    title,
        "prop":      "links",
        "pllimit":   limit,
        "plnamespace": 0,   # Only article namespace (not talk pages etc.)
    })

    pages  = data.get("query", {}).get("pages", {})
    links  = []
    for page_data in pages.values():
        for link in page_data.get("links", []):
            links.append(link["title"])

    return links


async def get_page_images(title: str) -> list[dict]:
    """Get all images used in a Wikipedia article."""
    data = await wiki_api_request({
        "action":  "query",
        "titles":  title,
        "prop":    "images",
        "imlimit": 50,
    })

    pages  = data.get("query", {}).get("pages", {})
    images = []
    for page_data in pages.values():
        for img in page_data.get("images", []):
            name = img["title"].replace("File:", "")
            if any(name.lower().endswith(ext) for ext in [".jpg", ".png", ".svg", ".gif"]):
                images.append({
                    "filename": name,
                    "wiki_url": f"https://commons.wikimedia.org/wiki/File:{name.replace(' ', '_')}",
                })

    return images


async def main_api():
    # Search
    print("── Search results for 'python web scraping' ──")
    results = await search_wikipedia("python web scraping", limit=5)
    for r in results:
        print(f"  {r['title']}: {r['snippet'][:80]}...")

    # Summary
    print("\n── Page summary ──")
    summary = await get_page_summary("Web scraping")
    print(f"  {summary['title']}: {summary['extract'][:200]}...")

    # Links
    print("\n── Top links from 'Web scraping' article ──")
    links = await get_page_links("Web scraping", limit=20)
    print(f"  {', '.join(links[:10])}")

asyncio.run(main_api())

Method 3: Extracting Wikipedia Infoboxes

You can use BeautifulSoup to find and parse infobox content, transforming unstructured web content into structured datasets useful for NLP training, trend analysis, or data journalism.

import httpx
import pandas as pd
from bs4 import BeautifulSoup
import re

HEADERS = {"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"}

def fetch_wikipedia_html(title: str) -> str:
    """Fetch the raw HTML of a Wikipedia page."""
    url = f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}"
    r   = httpx.get(url, headers=HEADERS, follow_redirects=True, timeout=15)
    r.raise_for_status()
    return r.text


def parse_infobox(html: str) -> dict:
    """
    Extract key-value pairs from a Wikipedia infobox.
    Works with most infobox types: country, person, company, city, film, etc.
    """
    soup    = BeautifulSoup(html, "lxml")
    infobox = soup.select_one("table.infobox")

    if not infobox:
        return {}

    data = {}
    for row in infobox.select("tr"):
        # Standard two-column layout: th = label, td = value
        label_el = row.select_one("th")
        value_el  = row.select_one("td")

        if label_el and value_el:
            label = label_el.get_text(separator=" ", strip=True)
            # Strip footnote numbers like [1], [2]
            value = re.sub(r"\[\d+\]", "", value_el.get_text(separator=" ", strip=True))
            value = re.sub(r"\s+", " ", value).strip()

            if label and value and len(label) < 80:
                data[label] = value

    return data


def parse_all_tables(html: str) -> list[pd.DataFrame]:
    """
    Extract all wikitables from a Wikipedia page as DataFrames.
    Great for statistical tables, comparison tables, rankings.
    """
    try:
        tables = pd.read_html(html, flavor="lxml")
        return tables
    except Exception:
        return []


# Example: Extract country data from Wikipedia infoboxes
countries = ["India", "China", "United_States", "Brazil", "Germany"]

country_data = []
for country in countries:
    print(f"Fetching: {country}...")
    html     = fetch_wikipedia_html(country)
    infobox  = parse_infobox(html)
    tables   = parse_all_tables(html)

    # Standardise common infobox fields
    country_data.append({
        "country":      country,
        "capital":      infobox.get("Capital") or infobox.get("Capital city"),
        "population":   infobox.get("Population"),
        "area":         infobox.get("Area"),
        "gdp_nominal":  infobox.get("GDP (nominal)") or infobox.get("GDP (PPP)"),
        "currency":     infobox.get("Currency"),
        "language":     infobox.get("Official languages"),
        "government":   infobox.get("Government"),
        "tables_found": len(tables),
    })
    time.sleep(0.5)

df = pd.DataFrame(country_data)
print(df.to_string(index=False))
df.to_csv("country_infoboxes.csv", index=False)

Extracting specific tables (rankings, statistics)

def get_wikipedia_table(title: str, table_index: int = 0, match_text: str = None) -> pd.DataFrame | None:
    """
    Extract a specific table from a Wikipedia page.

    Args:
        title: Wikipedia article title
        table_index: Which table to return (0 = first)
        match_text: Return the first table whose header contains this text
    """
    html    = fetch_wikipedia_html(title)
    tables  = parse_all_tables(html)

    if not tables:
        print(f"No tables found on '{title}'")
        return None

    if match_text:
        for t in tables:
            # Check if any column header contains match_text
            if any(match_text.lower() in str(col).lower() for col in t.columns):
                return t
        print(f"No table with '{match_text}' found")
        return None

    if table_index >= len(tables):
        print(f"Table index {table_index} out of range (found {len(tables)} tables)")
        return None

    return tables[table_index]


# Get the list of largest cities by population
cities_df = get_wikipedia_table(
    "List of largest cities",
    match_text="Population"
)
if cities_df is not None:
    print(f"Found table with {len(cities_df)} rows")
    print(cities_df.head(10))
    cities_df.to_csv("largest_cities.csv", index=False)

# Get Nobel Prize winners table
nobel_df = get_wikipedia_table("List of Nobel laureates", table_index=0)
if nobel_df is not None:
    print(f"\nNobel laureates table: {len(nobel_df)} entries")
    print(nobel_df.head(5))

Method 4: Building a Wikipedia Knowledge Graph Scraper

Wikipedia's internal links create a natural knowledge graph. Following links allows you to scrape entire topic clusters:

import asyncio
import httpx
import json
import time
from collections import deque

HEADERS = {"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"}

async def build_knowledge_graph(
    seed_topic: str,
    max_articles: int = 50,
    max_depth: int = 2,
    language: str = "en"
) -> dict:
    """
    Build a knowledge graph by crawling Wikipedia starting from a seed topic.
    Follows internal links up to max_depth levels.

    Returns: {
        "nodes": [{title, summary, url, depth}],
        "edges": [(source_title, target_title)]
    }
    """
    nodes  = {}
    edges  = []
    queue  = deque([(seed_topic, 0)])
    visited = set()

    async with httpx.AsyncClient(headers=HEADERS) as client:
        while queue and len(nodes) < max_articles:
            title, depth = queue.popleft()

            if title in visited or depth > max_depth:
                continue
            visited.add(title)

            print(f"  [depth={depth}] {title} ({len(nodes)}/{max_articles})")

            # Fetch summary
            try:
                r = await client.get(
                    f"https://en.wikipedia.org/api/rest_v1/page/summary/"
                    f"{title.replace(' ', '_')}",
                    timeout=10
                )
                if r.status_code != 200:
                    continue
                data = r.json()
            except Exception:
                continue

            nodes[title] = {
                "title":       data.get("title", title),
                "description": data.get("description", ""),
                "summary":     (data.get("extract") or "")[:300],
                "url":         data.get("content_urls", {})
                                   .get("desktop", {}).get("page", ""),
                "depth":       depth,
            }

            # Fetch links if not at max depth
            if depth < max_depth:
                links_r = await client.get(
                    "https://en.wikipedia.org/w/api.php",
                    params={
                        "action": "query", "titles": title,
                        "prop": "links", "pllimit": 20,
                        "plnamespace": 0, "format": "json"
                    },
                    timeout=10
                )
                links_data = links_r.json()
                pages      = links_data.get("query", {}).get("pages", {})

                for page_data in pages.values():
                    for link in page_data.get("links", []):
                        link_title = link["title"]
                        edges.append((title, link_title))
                        if link_title not in visited:
                            queue.append((link_title, depth + 1))

            await asyncio.sleep(0.3)   # Polite delay

    return {
        "seed":   seed_topic,
        "nodes":  list(nodes.values()),
        "edges":  edges[:500],   # Limit edges for manageability
        "stats":  {
            "total_articles": len(nodes),
            "total_edges":    len(edges),
            "max_depth":      max_depth,
        }
    }


async def main_graph():
    print("Building knowledge graph for 'Machine Learning'...")
    graph = await build_knowledge_graph(
        seed_topic="Machine learning",
        max_articles=30,
        max_depth=2
    )

    print(f"\nGraph built:")
    print(f"  Articles: {graph['stats']['total_articles']}")
    print(f"  Connections: {graph['stats']['total_edges']}")

    # Save as JSON for use in graph tools (NetworkX, Gephi, D3.js)
    with open("knowledge_graph.json", "w") as f:
        json.dump(graph, f, indent=2, ensure_ascii=False)

    # Save nodes as CSV for analysis
    nodes_df = pd.DataFrame(graph["nodes"])
    nodes_df.to_csv("knowledge_graph_nodes.csv", index=False)
    print(f"\nSaved to knowledge_graph.json and knowledge_graph_nodes.csv")

asyncio.run(main_graph())

Part 2: Wikidata — Structured Data from Wikipedia

Wikidata is the structured data backbone behind Wikipedia — a giant knowledge base of facts in machine-readable form. Every Wikipedia article links to a Wikidata entity with typed properties.

import httpx
import asyncio
import pandas as pd

WIKIDATA_SPARQL = "https://query.wikidata.org/sparql"
HEADERS = {
    "User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)",
    "Accept": "application/sparql-results+json"
}

async def wikidata_query(sparql: str) -> list[dict]:
    """
    Execute a SPARQL query against Wikidata.
    Returns a list of result dicts.
    """
    async with httpx.AsyncClient(headers=HEADERS) as client:
        r = await client.get(
            WIKIDATA_SPARQL,
            params={"query": sparql, "format": "json"},
            timeout=30
        )
        r.raise_for_status()
        data = r.json()

    bindings = data.get("results", {}).get("bindings", [])
    results  = []
    for row in bindings:
        results.append({
            key: val.get("value") for key, val in row.items()
        })
    return results


async def get_country_data() -> pd.DataFrame:
    """Get population and GDP data for all countries from Wikidata."""
    sparql = """
    SELECT ?country ?countryLabel ?population ?gdp ?capital ?capitalLabel WHERE {
        ?country wdt:P31 wd:Q3624078.      # Instance of: sovereign state
        OPTIONAL { ?country wdt:P1082 ?population. }
        OPTIONAL { ?country wdt:P2131 ?gdp. }
        OPTIONAL { ?country wdt:P36 ?capital. }
        SERVICE wikibase:label {
            bd:serviceParam wikibase:language "en".
        }
    }
    ORDER BY DESC(?population)
    LIMIT 50
    """
    results = await wikidata_query(sparql)
    df = pd.DataFrame(results)
    return df


async def get_tech_companies() -> pd.DataFrame:
    """Get major tech companies with founding year and HQ from Wikidata."""
    sparql = """
    SELECT ?company ?companyLabel ?founded ?hq ?hqLabel ?employees WHERE {
        ?company wdt:P31 wd:Q4830453.      # Business enterprise
        ?company wdt:P452 wd:Q11032.       # Industry: computer software
        OPTIONAL { ?company wdt:P571 ?founded. }
        OPTIONAL { ?company wdt:P159 ?hq. }
        OPTIONAL { ?company wdt:P1128 ?employees. }
        SERVICE wikibase:label {
            bd:serviceParam wikibase:language "en".
        }
        FILTER(BOUND(?employees) && ?employees > 1000)
    }
    ORDER BY DESC(?employees)
    LIMIT 30
    """
    results = await wikidata_query(sparql)
    return pd.DataFrame(results)


async def main_wikidata():
    print("── Top 20 countries by population (Wikidata) ──")
    countries = await get_country_data()
    print(countries[["countryLabel", "population", "gdp"]].head(20))
    countries.to_csv("wikidata_countries.csv", index=False)

    print("\n── Major tech companies (Wikidata) ──")
    companies = await get_tech_companies()
    print(companies[["companyLabel", "founded", "hqLabel", "employees"]].head(15))
    companies.to_csv("wikidata_tech_companies.csv", index=False)

asyncio.run(main_wikidata())

Part 3: India's Open Government Data — data.gov.in

pip install datagovindia pandas
# Get your free API key at: data.gov.in/user/register

from datagovindia import DataGovIndia
import pandas as pd

# Initialise with your API key
dgi = DataGovIndia(api_key="YOUR_DATA_GOV_IN_API_KEY")

# ── Discovery: Find relevant datasets ──────────────────────────
def search_datasets(keyword: str, limit: int = 10) -> pd.DataFrame:
    """Search for government datasets by keyword."""
    results = dgi.search_data(keyword, results=limit)
    return pd.DataFrame(results)

# Search for datasets
education_datasets = search_datasets("school enrollment india")
print("── Education datasets on data.gov.in ──")
print(education_datasets[["title", "org_type", "source"]].head(10))

agriculture_datasets = search_datasets("crop production india state")
print("\n── Agriculture datasets ──")
print(agriculture_datasets[["title", "source"]].head(5))

# ── Get dataset info ──────────────────────────────────────────
def get_dataset_info(index_name: str) -> dict:
    """Get metadata about a specific dataset."""
    return dgi.get_data_info(index_name)

# ── Download a dataset ────────────────────────────────────────
def download_dataset(index_name: str, limit: int = 1000) -> pd.DataFrame:
    """Download records from a government dataset."""
    data = dgi.get_data(index_name, results=limit)
    return pd.DataFrame(data)


# Example datasets (index_names from data.gov.in search):

# State-wise school enrollment data
SCHOOL_DATASET = "your-dataset-index-name-from-search"

try:
    df = download_dataset(SCHOOL_DATASET)
    print(f"\nDownloaded {len(df)} records")
    print(df.head())
    df.to_csv("india_school_enrollment.csv", index=False)
except Exception as e:
    print(f"Dataset access error: {e}")

Direct API access (no wrapper library)

import httpx
import asyncio
import pandas as pd

DATA_GOV_BASE = "https://api.data.gov.in/resource"
API_KEY       = "YOUR_DATA_GOV_IN_API_KEY"

async def fetch_gov_dataset(
    resource_id: str,
    limit: int = 100,
    offset: int = 0,
    filters: dict = None
) -> dict:
    """
    Fetch records from a data.gov.in dataset.

    Args:
        resource_id: Dataset UUID from data.gov.in (visible in URL)
        limit: Records per request (max 100)
        offset: Pagination offset
        filters: Dict of field:value filters, e.g. {"State": "Maharashtra"}
    """
    params = {
        "api-key": API_KEY,
        "format":  "json",
        "limit":   limit,
        "offset":  offset,
    }

    if filters:
        for field, value in filters.items():
            params[f"filters[{field}]"] = value

    url = f"{DATA_GOV_BASE}/{resource_id}"
    async with httpx.AsyncClient() as client:
        r = await client.get(url, params=params, timeout=20)
        r.raise_for_status()
    return r.json()


async def download_full_dataset(resource_id: str, max_records: int = 5000) -> pd.DataFrame:
    """
    Download an entire dataset by paginating through all records.
    """
    all_records = []
    offset      = 0
    page_size   = 100

    while len(all_records) < max_records:
        data = await fetch_gov_dataset(resource_id, limit=page_size, offset=offset)

        records = data.get("records", [])
        if not records:
            break

        all_records.extend(records)
        total = int(data.get("total", 0))

        print(f"  Downloaded {len(all_records)}/{min(total, max_records)} records...")

        if len(all_records) >= total or len(all_records) >= max_records:
            break

        offset += page_size
        await asyncio.sleep(0.5)   # Polite delay

    df = pd.DataFrame(all_records)
    print(f"  Total: {len(df)} records in {len(df.columns)} columns")
    return df


# Real dataset example: India's consumer price index data
# Resource IDs can be found on data.gov.in dataset pages
# Format: https://api.data.gov.in/resource/RESOURCE-UUID?api-key=KEY&format=json

async def main_datagov():
    # Example: Population census data (replace with actual resource ID)
    CENSUS_RESOURCE_ID = "9ef84268-d588-465a-a308-a864a43d0070"

    print("Downloading India census data...")
    df = await download_full_dataset(CENSUS_RESOURCE_ID, max_records=500)

    if not df.empty:
        print(f"\nDataset shape: {df.shape}")
        print(f"Columns: {list(df.columns)}")
        print(df.head())
        df.to_csv("india_census_data.csv", index=False)

asyncio.run(main_datagov())

Part 4: US Federal Data — data.gov

The US government's data.gov platform hosts hundreds of thousands of datasets from every federal agency — health, transportation, energy, agriculture, economy, education. All free, all open.

import httpx
import asyncio
import pandas as pd

DATAGOV_API = "https://catalog.data.gov/api/3/action"

async def search_us_datasets(
    query: str,
    limit: int = 10,
    organization: str = None
) -> pd.DataFrame:
    """
    Search the US data.gov catalog.

    Args:
        query: Search terms
        limit: Number of results
        organization: Filter by agency, e.g. "census-bureau", "cdc-gov"
    """
    params = {
        "q":     query,
        "rows":  limit,
        "sort":  "score desc",
    }
    if organization:
        params["fq"] = f"organization:{organization}"

    async with httpx.AsyncClient() as client:
        r = await client.get(
            f"{DATAGOV_API}/package_search",
            params=params,
            timeout=15
        )
        data = r.json()

    datasets = []
    for item in data.get("result", {}).get("results", []):
        datasets.append({
            "name":         item.get("name"),
            "title":        item.get("title"),
            "organization": item.get("organization", {}).get("title"),
            "description":  (item.get("notes") or "")[:200],
            "formats":      [r.get("format") for r in item.get("resources", [])],
            "url":          f"https://catalog.data.gov/dataset/{item.get('name')}",
            "downloads":    item.get("tracking_summary", {}).get("total", 0),
        })

    return pd.DataFrame(datasets)


async def download_datagov_csv(resource_url: str) -> pd.DataFrame:
    """Download a CSV dataset directly from data.gov."""
    async with httpx.AsyncClient(follow_redirects=True) as client:
        r = await client.get(resource_url, timeout=60)
        r.raise_for_status()

    import io
    df = pd.read_csv(io.StringIO(r.text))
    print(f"Downloaded: {df.shape[0]} rows × {df.shape[1]} columns")
    return df


async def get_dataset_resources(dataset_name: str) -> list[dict]:
    """Get all downloadable resources for a dataset."""
    async with httpx.AsyncClient() as client:
        r = await client.get(
            f"{DATAGOV_API}/package_show",
            params={"id": dataset_name},
            timeout=15
        )
        data = r.json()

    resources = []
    for res in data.get("result", {}).get("resources", []):
        resources.append({
            "name":   res.get("name"),
            "format": res.get("format"),
            "url":    res.get("url"),
            "size":   res.get("size"),
        })
    return resources


async def main_datagov_us():
    # Search for health datasets
    print("── Searching data.gov for COVID datasets ──")
    health_df = await search_us_datasets("COVID vaccination rates state", limit=5)
    print(health_df[["title", "organization", "formats"]].to_string(index=False))

    # Search for economic datasets
    print("\n── Economic datasets from Census Bureau ──")
    econ_df = await search_us_datasets(
        "employment statistics",
        limit=5,
        organization="census-bureau"
    )
    print(econ_df[["title", "description"]].to_string(index=False))

asyncio.run(main_datagov_us())

Part 5: Combining Wikipedia + Government Data

The real power comes from joining multiple open sources. Here's an example combining Wikipedia infobox data with official government statistics:

import asyncio
import pandas as pd

async def build_india_states_dataset() -> pd.DataFrame:
    """
    Build a comprehensive dataset of Indian states by combining:
    1. Wikipedia infoboxes (area, founded, capital)
    2. data.gov.in (official population, literacy, GDP)
    """
    INDIAN_STATES = [
        "Maharashtra", "Uttar Pradesh", "Karnataka", "Tamil Nadu",
        "Gujarat", "Rajasthan", "West Bengal", "Andhra Pradesh",
        "Telangana", "Kerala", "Madhya Pradesh", "Bihar"
    ]

    # ── Wikipedia data ────────────────────────────────────────
    print("Fetching Wikipedia infobox data...")
    wiki_records = []
    for state in INDIAN_STATES:
        html    = fetch_wikipedia_html(f"{state}_state")
        infobox = parse_infobox(html)
        wiki_records.append({
            "state":    state,
            "capital":  infobox.get("Capital") or infobox.get("Capital city"),
            "area_km2": infobox.get("Area"),
            "founded":  infobox.get("Formation") or infobox.get("Established"),
            "districts":infobox.get("Districts"),
            "wiki_url": f"https://en.wikipedia.org/wiki/{state.replace(' ', '_')}",
        })
        await asyncio.sleep(0.5)

    wiki_df = pd.DataFrame(wiki_records)

    # ── Merge ─────────────────────────────────────────────────
    final_df = wiki_df
    final_df.to_csv("india_states_combined.csv", index=False)

    print(f"\nBuilt dataset: {len(final_df)} states × {len(final_df.columns)} columns")
    print(final_df[["state", "capital", "area_km2"]].to_string(index=False))
    return final_df

asyncio.run(build_india_states_dataset())

Wikipedia Database Dumps (For Bulk Access)

For extremely large-scale research (millions of articles), use Wikipedia's official database dumps instead of the API:

# Download the latest English Wikipedia article dump (~22GB compressed)
# Use dumps.wikimedia.org — updated monthly
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

# For just abstracts and titles (~1GB) — much more practical
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml.gz

import gzip
import xml.etree.ElementTree as ET
import pandas as pd

def parse_wikipedia_abstracts(dump_file: str, max_articles: int = 10000) -> pd.DataFrame:
    """
    Parse the Wikipedia abstracts dump into a DataFrame.
    Much faster than API calls for large datasets.
    """
    records = []

    with gzip.open(dump_file, "rb") as f:
        for event, elem in ET.iterparse(f, events=["end"]):
            if elem.tag == "doc" and len(records) < max_articles:
                title    = elem.findtext("title", "").replace("Wikipedia: ", "")
                url      = elem.findtext("url", "")
                abstract = elem.findtext("abstract", "")

                if title and abstract and len(abstract) > 50:
                    records.append({
                        "title":    title,
                        "url":      url,
                        "abstract": abstract[:500],
                        "length":   len(abstract),
                    })

                elem.clear()   # Free memory

    df = pd.DataFrame(records)
    print(f"Parsed {len(df):,} articles from dump")
    return df

# df = parse_wikipedia_abstracts("enwiki-latest-abstract.xml.gz", max_articles=100000)
# df.to_parquet("wikipedia_abstracts.parquet")  # Parquet for efficient storage

FAQ

Summary

Source	What you get	Method	Licence
Wikipedia API	Article text, summaries, links	`wikipedia-api` + MediaWiki API	CC BY-SA 4.0
Wikipedia HTML	Infoboxes, tables, images	BeautifulSoup	CC BY-SA 4.0
Wikidata SPARQL	Structured facts, relationships	SPARQL query	CC0 (public domain)
Wikipedia dumps	Millions of articles	XML parsing	CC BY-SA 4.0
data.gov.in	100,000+ India govt datasets	`datagovindia` + REST API	NDSAP (open, free)
data.gov (US)	300,000+ US federal datasets	REST API	Mostly public domain

Scraping Wikipedia & Open Government Data with Python (2026 Complete Guide)

Introduction: The Cleanest Data on the Internet

Part 1: Wikipedia — Four Ways to Access Data

Method 1: The `wikipedia-api` Library (Easiest)

Batch fetching multiple articles

Method 2: The MediaWiki REST API (Most Powerful)

Method 3: Extracting Wikipedia Infoboxes

Extracting specific tables (rankings, statistics)

Method 4: Building a Wikipedia Knowledge Graph Scraper

Part 2: Wikidata — Structured Data from Wikipedia

Part 3: India's Open Government Data — data.gov.in

Direct API access (no wrapper library)

Part 4: US Federal Data — data.gov

Part 5: Combining Wikipedia + Government Data

Wikipedia Database Dumps (For Bulk Access)

FAQ

Summary

ZyVOP

Comments (0)

Scraping Wikipedia & Open Government Data with Python (2026 Complete Guide)

Introduction: The Cleanest Data on the Internet

Part 1: Wikipedia — Four Ways to Access Data

Method 1: The `wikipedia-api` Library (Easiest)

Batch fetching multiple articles

Method 2: The MediaWiki REST API (Most Powerful)

Method 3: Extracting Wikipedia Infoboxes

Extracting specific tables (rankings, statistics)

Method 4: Building a Wikipedia Knowledge Graph Scraper

Part 2: Wikidata — Structured Data from Wikipedia

Part 3: India's Open Government Data — data.gov.in

Direct API access (no wrapper library)

Part 4: US Federal Data — data.gov

Part 5: Combining Wikipedia + Government Data

Wikipedia Database Dumps (For Bulk Access)

FAQ

Summary

ZyVOP

Comments (0)

Popular Tags

Scraping Wikipedia & Open Government Data with Python (2026 Complete Guide)

Introduction: The Cleanest Data on the Internet

Part 1: Wikipedia — Four Ways to Access Data

Method 1: The wikipedia-api Library (Easiest)

Batch fetching multiple articles

Method 2: The MediaWiki REST API (Most Powerful)

Method 3: Extracting Wikipedia Infoboxes

Extracting specific tables (rankings, statistics)

Method 4: Building a Wikipedia Knowledge Graph Scraper

Part 2: Wikidata — Structured Data from Wikipedia

Part 3: India's Open Government Data — data.gov.in

Direct API access (no wrapper library)

Part 4: US Federal Data — data.gov

Part 5: Combining Wikipedia + Government Data

Wikipedia Database Dumps (For Bulk Access)

FAQ

Summary

ZyVOP

Comments (0)

Scraping Wikipedia & Open Government Data with Python (2026 Complete Guide)

Introduction: The Cleanest Data on the Internet

Part 1: Wikipedia — Four Ways to Access Data

Method 1: The wikipedia-api Library (Easiest)

Batch fetching multiple articles

Method 2: The MediaWiki REST API (Most Powerful)

Method 3: Extracting Wikipedia Infoboxes

Extracting specific tables (rankings, statistics)

Method 4: Building a Wikipedia Knowledge Graph Scraper

Part 2: Wikidata — Structured Data from Wikipedia

Part 3: India's Open Government Data — data.gov.in

Direct API access (no wrapper library)

Part 4: US Federal Data — data.gov

Part 5: Combining Wikipedia + Government Data

Wikipedia Database Dumps (For Bulk Access)

FAQ

Summary

ZyVOP

Comments (0)

Popular Tags

Method 1: The `wikipedia-api` Library (Easiest)

Method 1: The `wikipedia-api` Library (Easiest)