ZyVOP Logo
Content That Connects
SeriesCategoriesTags
ZyVOP Logo
Content That Connects

Empowering developers and creators with cutting-edge insights, comprehensive tutorials, and innovative solutions for the digital future.

Content

  • Tags
  • Write Article
  • Newsletter

Company

  • About Us
  • Contact

Connect

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • DMCA Policy
  • Code of Conduct

© 2026 ZyVOP. Crafted with care for the developer community.

Made with ❤️ by the ZyVOP team
All systems operational
HomeScrapingScraping Wikipedia & Open Government Data with Python (2026 Complete Guide)
Scraping

Scraping Wikipedia & Open Government Data with Python (2026 Complete Guide)

Learn how to scrape Wikipedia articles, infoboxes, and tables with Python in 2026 — plus access 100,000+ datasets from India's data.gov.in and the US data.gov APIs. Full working code.

#python wikipedia scraping#scrape wikipedia python tutorial#wikipedia api python#data.gov.in python api tutorial#open government data python#wikipedia infobox scraping python
Z
ZyVOP

Senior Developer

June 18, 2026
14 min read
16 views
Scraping Wikipedia & Open Government Data with Python (2026 Complete Guide)

Introduction: The Cleanest Data on the Internet

Every blog in this series so far has dealt with the hard problem: sites that don't want you scraping them. Anti-bot systems, aggressive rate limiting, JavaScript-rendered content, accounts that get banned, legal grey areas.

Wikipedia and open government data platforms are the complete opposite. They're designed to be scraped. They want you to use their data. They provide official APIs, detailed documentation, and explicit licensing that allows free commercial and non-commercial use. Wikipedia's content is published under Creative Commons Attribution-ShareAlike. The Indian government's data.gov.in platform explicitly invites developers to build on its 100,000+ APIs. The US data.gov provides open access to hundreds of thousands of federal datasets.

This is the most beginner-friendly blog in the entire series — but the data available through these sources is extraordinarily powerful. Wikipedia alone contains structured information on over 6.7 million topics in English, with infoboxes full of structured data, tables full of statistics, and cross-linked articles ideal for knowledge graph construction.

By the end of this guide you'll be able to:

  • Use the Wikipedia API and wikipedia-api library to extract articles, summaries, and links

  • Parse Wikipedia infoboxes and tables into clean pandas DataFrames

  • Build a multi-article knowledge scraper with link following

  • Access India's data.gov.in platform and its 100,000+ government datasets

  • Download and analyse US federal datasets from data.gov

  • Combine Wikipedia and government data into a unified research dataset


Part 1: Wikipedia — Four Ways to Access Data

Wikipedia offers four distinct access methods, each suited to different use cases:

Method

Best for

Rate limit

Effort

wikipedia-api library

Article text, summaries, links

Generous

Very Low

MediaWiki REST API

Any structured query

Generous

Low

wptools library

Infoboxes, structured data

Generous

Low

Direct HTML scraping

Tables, custom extraction

Generous

Medium

Database dumps

Bulk data (millions of articles)

None

High

Wikipedia's robots.txt explicitly allows scraping for research use. Their only request: use the API rather than raw HTML scraping wherever possible, and add a descriptive User-Agent.


Method 1: The wikipedia-api Library (Easiest)

Extracting properly organized data from Wikipedia can simplify and speed up your research, NLP training, or content scraping processes. With just a few libraries — wikipedia, BeautifulSoup, and pandas — you can transform unstructured encyclopedia content into usable data.

pip install wikipedia-api pandas
import wikipediaapi

# Create a Wikipedia API client
# Always include a descriptive User-Agent with contact info
wiki = wikipediaapi.Wikipedia(
    language="en",
    user_agent="ResearchBot/1.0 (contact@yourdomain.com; educational use)"
)

def get_article(title: str) -> dict | None:
    """
    Fetch a Wikipedia article by title.
    Returns structured data including summary, full text, links, and sections.
    """
    page = wiki.page(title)

    if not page.exists():
        print(f"Article not found: '{title}'")
        return None

    return {
        "title":      page.title,
        "url":        page.fullurl,
        "summary":    page.summary,            # First 2–3 paragraphs
        "full_text":  page.text,               # Complete article text
        "word_count": len(page.text.split()),
        "sections":   [s.title for s in page.sections],
        "links":      list(page.links.keys())[:50],   # First 50 linked articles
        "categories": list(page.categories.keys())[:20],
        "language":   page.language,
    }


# Single article
article = get_article("Machine Learning")
if article:
    print(f"Title:      {article['title']}")
    print(f"Words:      {article['word_count']:,}")
    print(f"Sections:   {', '.join(article['sections'][:5])}")
    print(f"Summary:    {article['summary'][:300]}...")
    print(f"Links to:   {', '.join(article['links'][:8])}")

Batch fetching multiple articles

import time
import pandas as pd

def batch_fetch_articles(titles: list[str], delay: float = 0.5) -> pd.DataFrame:
    """
    Fetch multiple Wikipedia articles and return as a DataFrame.
    Includes polite delay between requests.
    """
    records = []

    for i, title in enumerate(titles):
        print(f"[{i+1}/{len(titles)}] Fetching: {title}")
        article = get_article(title)

        if article:
            records.append({
                "title":        article["title"],
                "url":          article["url"],
                "word_count":   article["word_count"],
                "sections":     len(article["sections"]),
                "links":        len(article["links"]),
                "summary":      article["summary"][:500],
            })

        time.sleep(delay)   # Wikipedia asks for polite delays

    return pd.DataFrame(records)


# Fetch articles on Python scraping ecosystem
topics = [
    "Web scraping", "BeautifulSoup", "Scrapy (software)",
    "Playwright (software)", "Selenium (software)",
    "XPath", "CSS", "HTML", "JSON", "Python (programming language)"
]

df = batch_fetch_articles(topics)
df.to_csv("wikipedia_articles.csv", index=False)
print(df[["title", "word_count", "sections", "links"]])

Method 2: The MediaWiki REST API (Most Powerful)

The MediaWiki API is Wikipedia's official programmatic interface. It provides far more control than the wikipedia-api library — you can query search results, get page metadata, fetch revision history, and access structured Wikidata.

import httpx
import asyncio
import pandas as pd

WIKI_API = "https://en.wikipedia.org/w/api.php"
HEADERS  = {
    "User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"
}

async def wiki_api_request(params: dict) -> dict:
    """Make a request to the MediaWiki API."""
    params["format"] = "json"
    async with httpx.AsyncClient(headers=HEADERS) as client:
        r = await client.get(WIKI_API, params=params, timeout=15)
        r.raise_for_status()
    return r.json()


async def search_wikipedia(query: str, limit: int = 10) -> list[dict]:
    """Full-text search across Wikipedia."""
    data = await wiki_api_request({
        "action":   "query",
        "list":     "search",
        "srsearch": query,
        "srlimit":  limit,
        "srprop":   "snippet|titlesnippet|size|wordcount",
    })

    results = []
    for item in data.get("query", {}).get("search", []):
        # Strip HTML tags from snippet
        import re
        snippet = re.sub(r"<[^>]+>", "", item.get("snippet", ""))
        results.append({
            "title":      item["title"],
            "snippet":    snippet,
            "word_count": item.get("wordcount", 0),
            "size_bytes": item.get("size", 0),
            "url":        f"https://en.wikipedia.org/wiki/{item['title'].replace(' ', '_')}",
        })

    return results


async def get_page_summary(title: str) -> dict:
    """Get a concise summary using Wikipedia's REST summary endpoint."""
    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title.replace(' ', '_')}"
    async with httpx.AsyncClient(headers=HEADERS) as client:
        r = await client.get(url, timeout=10)
        if r.status_code == 404:
            return {}
        data = r.json()

    return {
        "title":       data.get("title"),
        "description": data.get("description"),
        "extract":     data.get("extract"),        # Plain text summary
        "thumbnail":   data.get("thumbnail", {}).get("source"),
        "url":         data.get("content_urls", {}).get("desktop", {}).get("page"),
        "wikidata_id": data.get("wikibase_item"),  # For Wikidata queries
    }


async def get_page_links(title: str, limit: int = 100) -> list[str]:
    """Get all internal links from a Wikipedia page."""
    data = await wiki_api_request({
        "action":    "query",
        "titles":    title,
        "prop":      "links",
        "pllimit":   limit,
        "plnamespace": 0,   # Only article namespace (not talk pages etc.)
    })

    pages  = data.get("query", {}).get("pages", {})
    links  = []
    for page_data in pages.values():
        for link in page_data.get("links", []):
            links.append(link["title"])

    return links


async def get_page_images(title: str) -> list[dict]:
    """Get all images used in a Wikipedia article."""
    data = await wiki_api_request({
        "action":  "query",
        "titles":  title,
        "prop":    "images",
        "imlimit": 50,
    })

    pages  = data.get("query", {}).get("pages", {})
    images = []
    for page_data in pages.values():
        for img in page_data.get("images", []):
            name = img["title"].replace("File:", "")
            if any(name.lower().endswith(ext) for ext in [".jpg", ".png", ".svg", ".gif"]):
                images.append({
                    "filename": name,
                    "wiki_url": f"https://commons.wikimedia.org/wiki/File:{name.replace(' ', '_')}",
                })

    return images


async def main_api():
    # Search
    print("── Search results for 'python web scraping' ──")
    results = await search_wikipedia("python web scraping", limit=5)
    for r in results:
        print(f"  {r['title']}: {r['snippet'][:80]}...")

    # Summary
    print("\n── Page summary ──")
    summary = await get_page_summary("Web scraping")
    print(f"  {summary['title']}: {summary['extract'][:200]}...")

    # Links
    print("\n── Top links from 'Web scraping' article ──")
    links = await get_page_links("Web scraping", limit=20)
    print(f"  {', '.join(links[:10])}")

asyncio.run(main_api())

Method 3: Extracting Wikipedia Infoboxes

Infoboxes are the structured data panels on the right side of Wikipedia articles — they contain the most machine-readable data on Wikipedia: population, area, GDP, founding date, CEO, headquarters, etc.

You can use BeautifulSoup to find and parse infobox content, transforming unstructured web content into structured datasets useful for NLP training, trend analysis, or data journalism.

import httpx
import pandas as pd
from bs4 import BeautifulSoup
import re

HEADERS = {"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"}

def fetch_wikipedia_html(title: str) -> str:
    """Fetch the raw HTML of a Wikipedia page."""
    url = f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}"
    r   = httpx.get(url, headers=HEADERS, follow_redirects=True, timeout=15)
    r.raise_for_status()
    return r.text


def parse_infobox(html: str) -> dict:
    """
    Extract key-value pairs from a Wikipedia infobox.
    Works with most infobox types: country, person, company, city, film, etc.
    """
    soup    = BeautifulSoup(html, "lxml")
    infobox = soup.select_one("table.infobox")

    if not infobox:
        return {}

    data = {}
    for row in infobox.select("tr"):
        # Standard two-column layout: th = label, td = value
        label_el = row.select_one("th")
        value_el  = row.select_one("td")

        if label_el and value_el:
            label = label_el.get_text(separator=" ", strip=True)
            # Strip footnote numbers like [1], [2]
            value = re.sub(r"\[\d+\]", "", value_el.get_text(separator=" ", strip=True))
            value = re.sub(r"\s+", " ", value).strip()

            if label and value and len(label) < 80:
                data[label] = value

    return data


def parse_all_tables(html: str) -> list[pd.DataFrame]:
    """
    Extract all wikitables from a Wikipedia page as DataFrames.
    Great for statistical tables, comparison tables, rankings.
    """
    try:
        tables = pd.read_html(html, flavor="lxml")
        return tables
    except Exception:
        return []


# Example: Extract country data from Wikipedia infoboxes
countries = ["India", "China", "United_States", "Brazil", "Germany"]

country_data = []
for country in countries:
    print(f"Fetching: {country}...")
    html     = fetch_wikipedia_html(country)
    infobox  = parse_infobox(html)
    tables   = parse_all_tables(html)

    # Standardise common infobox fields
    country_data.append({
        "country":      country,
        "capital":      infobox.get("Capital") or infobox.get("Capital city"),
        "population":   infobox.get("Population"),
        "area":         infobox.get("Area"),
        "gdp_nominal":  infobox.get("GDP (nominal)") or infobox.get("GDP (PPP)"),
        "currency":     infobox.get("Currency"),
        "language":     infobox.get("Official languages"),
        "government":   infobox.get("Government"),
        "tables_found": len(tables),
    })
    time.sleep(0.5)

df = pd.DataFrame(country_data)
print(df.to_string(index=False))
df.to_csv("country_infoboxes.csv", index=False)

Extracting specific tables (rankings, statistics)

def get_wikipedia_table(title: str, table_index: int = 0, match_text: str = None) -> pd.DataFrame | None:
    """
    Extract a specific table from a Wikipedia page.

    Args:
        title: Wikipedia article title
        table_index: Which table to return (0 = first)
        match_text: Return the first table whose header contains this text
    """
    html    = fetch_wikipedia_html(title)
    tables  = parse_all_tables(html)

    if not tables:
        print(f"No tables found on '{title}'")
        return None

    if match_text:
        for t in tables:
            # Check if any column header contains match_text
            if any(match_text.lower() in str(col).lower() for col in t.columns):
                return t
        print(f"No table with '{match_text}' found")
        return None

    if table_index >= len(tables):
        print(f"Table index {table_index} out of range (found {len(tables)} tables)")
        return None

    return tables[table_index]


# Get the list of largest cities by population
cities_df = get_wikipedia_table(
    "List of largest cities",
    match_text="Population"
)
if cities_df is not None:
    print(f"Found table with {len(cities_df)} rows")
    print(cities_df.head(10))
    cities_df.to_csv("largest_cities.csv", index=False)

# Get Nobel Prize winners table
nobel_df = get_wikipedia_table("List of Nobel laureates", table_index=0)
if nobel_df is not None:
    print(f"\nNobel laureates table: {len(nobel_df)} entries")
    print(nobel_df.head(5))

Method 4: Building a Wikipedia Knowledge Graph Scraper

Wikipedia's internal links create a natural knowledge graph. Following links allows you to scrape entire topic clusters:

import asyncio
import httpx
import json
import time
from collections import deque

HEADERS = {"User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)"}

async def build_knowledge_graph(
    seed_topic: str,
    max_articles: int = 50,
    max_depth: int = 2,
    language: str = "en"
) -> dict:
    """
    Build a knowledge graph by crawling Wikipedia starting from a seed topic.
    Follows internal links up to max_depth levels.

    Returns: {
        "nodes": [{title, summary, url, depth}],
        "edges": [(source_title, target_title)]
    }
    """
    nodes  = {}
    edges  = []
    queue  = deque([(seed_topic, 0)])
    visited = set()

    async with httpx.AsyncClient(headers=HEADERS) as client:
        while queue and len(nodes) < max_articles:
            title, depth = queue.popleft()

            if title in visited or depth > max_depth:
                continue
            visited.add(title)

            print(f"  [depth={depth}] {title} ({len(nodes)}/{max_articles})")

            # Fetch summary
            try:
                r = await client.get(
                    f"https://en.wikipedia.org/api/rest_v1/page/summary/"
                    f"{title.replace(' ', '_')}",
                    timeout=10
                )
                if r.status_code != 200:
                    continue
                data = r.json()
            except Exception:
                continue

            nodes[title] = {
                "title":       data.get("title", title),
                "description": data.get("description", ""),
                "summary":     (data.get("extract") or "")[:300],
                "url":         data.get("content_urls", {})
                                   .get("desktop", {}).get("page", ""),
                "depth":       depth,
            }

            # Fetch links if not at max depth
            if depth < max_depth:
                links_r = await client.get(
                    "https://en.wikipedia.org/w/api.php",
                    params={
                        "action": "query", "titles": title,
                        "prop": "links", "pllimit": 20,
                        "plnamespace": 0, "format": "json"
                    },
                    timeout=10
                )
                links_data = links_r.json()
                pages      = links_data.get("query", {}).get("pages", {})

                for page_data in pages.values():
                    for link in page_data.get("links", []):
                        link_title = link["title"]
                        edges.append((title, link_title))
                        if link_title not in visited:
                            queue.append((link_title, depth + 1))

            await asyncio.sleep(0.3)   # Polite delay

    return {
        "seed":   seed_topic,
        "nodes":  list(nodes.values()),
        "edges":  edges[:500],   # Limit edges for manageability
        "stats":  {
            "total_articles": len(nodes),
            "total_edges":    len(edges),
            "max_depth":      max_depth,
        }
    }


async def main_graph():
    print("Building knowledge graph for 'Machine Learning'...")
    graph = await build_knowledge_graph(
        seed_topic="Machine learning",
        max_articles=30,
        max_depth=2
    )

    print(f"\nGraph built:")
    print(f"  Articles: {graph['stats']['total_articles']}")
    print(f"  Connections: {graph['stats']['total_edges']}")

    # Save as JSON for use in graph tools (NetworkX, Gephi, D3.js)
    with open("knowledge_graph.json", "w") as f:
        json.dump(graph, f, indent=2, ensure_ascii=False)

    # Save nodes as CSV for analysis
    nodes_df = pd.DataFrame(graph["nodes"])
    nodes_df.to_csv("knowledge_graph_nodes.csv", index=False)
    print(f"\nSaved to knowledge_graph.json and knowledge_graph_nodes.csv")

asyncio.run(main_graph())

Part 2: Wikidata — Structured Data from Wikipedia

Wikidata is the structured data backbone behind Wikipedia — a giant knowledge base of facts in machine-readable form. Every Wikipedia article links to a Wikidata entity with typed properties.

import httpx
import asyncio
import pandas as pd

WIKIDATA_SPARQL = "https://query.wikidata.org/sparql"
HEADERS = {
    "User-Agent": "ResearchBot/1.0 (contact@yourdomain.com)",
    "Accept": "application/sparql-results+json"
}

async def wikidata_query(sparql: str) -> list[dict]:
    """
    Execute a SPARQL query against Wikidata.
    Returns a list of result dicts.
    """
    async with httpx.AsyncClient(headers=HEADERS) as client:
        r = await client.get(
            WIKIDATA_SPARQL,
            params={"query": sparql, "format": "json"},
            timeout=30
        )
        r.raise_for_status()
        data = r.json()

    bindings = data.get("results", {}).get("bindings", [])
    results  = []
    for row in bindings:
        results.append({
            key: val.get("value") for key, val in row.items()
        })
    return results


async def get_country_data() -> pd.DataFrame:
    """Get population and GDP data for all countries from Wikidata."""
    sparql = """
    SELECT ?country ?countryLabel ?population ?gdp ?capital ?capitalLabel WHERE {
        ?country wdt:P31 wd:Q3624078.      # Instance of: sovereign state
        OPTIONAL { ?country wdt:P1082 ?population. }
        OPTIONAL { ?country wdt:P2131 ?gdp. }
        OPTIONAL { ?country wdt:P36 ?capital. }
        SERVICE wikibase:label {
            bd:serviceParam wikibase:language "en".
        }
    }
    ORDER BY DESC(?population)
    LIMIT 50
    """
    results = await wikidata_query(sparql)
    df = pd.DataFrame(results)
    return df


async def get_tech_companies() -> pd.DataFrame:
    """Get major tech companies with founding year and HQ from Wikidata."""
    sparql = """
    SELECT ?company ?companyLabel ?founded ?hq ?hqLabel ?employees WHERE {
        ?company wdt:P31 wd:Q4830453.      # Business enterprise
        ?company wdt:P452 wd:Q11032.       # Industry: computer software
        OPTIONAL { ?company wdt:P571 ?founded. }
        OPTIONAL { ?company wdt:P159 ?hq. }
        OPTIONAL { ?company wdt:P1128 ?employees. }
        SERVICE wikibase:label {
            bd:serviceParam wikibase:language "en".
        }
        FILTER(BOUND(?employees) && ?employees > 1000)
    }
    ORDER BY DESC(?employees)
    LIMIT 30
    """
    results = await wikidata_query(sparql)
    return pd.DataFrame(results)


async def main_wikidata():
    print("── Top 20 countries by population (Wikidata) ──")
    countries = await get_country_data()
    print(countries[["countryLabel", "population", "gdp"]].head(20))
    countries.to_csv("wikidata_countries.csv", index=False)

    print("\n── Major tech companies (Wikidata) ──")
    companies = await get_tech_companies()
    print(companies[["companyLabel", "founded", "hqLabel", "employees"]].head(15))
    companies.to_csv("wikidata_tech_companies.csv", index=False)

asyncio.run(main_wikidata())

Part 3: India's Open Government Data — data.gov.in

The datagovindia library is a wrapper around 100,000+ APIs of the Government of India's open data platform data.gov.in. Its functionality centres around three aspects: API discovery — finding the right API from all available APIs; API information — getting details about a particular API; and querying the API — getting a tidy dataset from the chosen API.

This is an extraordinary resource for Indian developers and researchers — census data, agricultural statistics, health data, economic indicators, transport data — all free, all official, all available via a consistent API.

pip install datagovindia pandas
# Get your free API key at: data.gov.in/user/register
from datagovindia import DataGovIndia
import pandas as pd

# Initialise with your API key
dgi = DataGovIndia(api_key="YOUR_DATA_GOV_IN_API_KEY")

# ── Discovery: Find relevant datasets ──────────────────────────
def search_datasets(keyword: str, limit: int = 10) -> pd.DataFrame:
    """Search for government datasets by keyword."""
    results = dgi.search_data(keyword, results=limit)
    return pd.DataFrame(results)

# Search for datasets
education_datasets = search_datasets("school enrollment india")
print("── Education datasets on data.gov.in ──")
print(education_datasets[["title", "org_type", "source"]].head(10))

agriculture_datasets = search_datasets("crop production india state")
print("\n── Agriculture datasets ──")
print(agriculture_datasets[["title", "source"]].head(5))

# ── Get dataset info ──────────────────────────────────────────
def get_dataset_info(index_name: str) -> dict:
    """Get metadata about a specific dataset."""
    return dgi.get_data_info(index_name)

# ── Download a dataset ────────────────────────────────────────
def download_dataset(index_name: str, limit: int = 1000) -> pd.DataFrame:
    """Download records from a government dataset."""
    data = dgi.get_data(index_name, results=limit)
    return pd.DataFrame(data)


# Example datasets (index_names from data.gov.in search):

# State-wise school enrollment data
SCHOOL_DATASET = "your-dataset-index-name-from-search"

try:
    df = download_dataset(SCHOOL_DATASET)
    print(f"\nDownloaded {len(df)} records")
    print(df.head())
    df.to_csv("india_school_enrollment.csv", index=False)
except Exception as e:
    print(f"Dataset access error: {e}")

Direct API access (no wrapper library)

The Open Government Data Platform India offers a collection of APIs that provide access to open datasets. Users can integrate these APIs into their applications to retrieve and utilize public data, enhancing data-driven solutions and innovations.

import httpx
import asyncio
import pandas as pd

DATA_GOV_BASE = "https://api.data.gov.in/resource"
API_KEY       = "YOUR_DATA_GOV_IN_API_KEY"

async def fetch_gov_dataset(
    resource_id: str,
    limit: int = 100,
    offset: int = 0,
    filters: dict = None
) -> dict:
    """
    Fetch records from a data.gov.in dataset.

    Args:
        resource_id: Dataset UUID from data.gov.in (visible in URL)
        limit: Records per request (max 100)
        offset: Pagination offset
        filters: Dict of field:value filters, e.g. {"State": "Maharashtra"}
    """
    params = {
        "api-key": API_KEY,
        "format":  "json",
        "limit":   limit,
        "offset":  offset,
    }

    if filters:
        for field, value in filters.items():
            params[f"filters[{field}]"] = value

    url = f"{DATA_GOV_BASE}/{resource_id}"
    async with httpx.AsyncClient() as client:
        r = await client.get(url, params=params, timeout=20)
        r.raise_for_status()
    return r.json()


async def download_full_dataset(resource_id: str, max_records: int = 5000) -> pd.DataFrame:
    """
    Download an entire dataset by paginating through all records.
    """
    all_records = []
    offset      = 0
    page_size   = 100

    while len(all_records) < max_records:
        data = await fetch_gov_dataset(resource_id, limit=page_size, offset=offset)

        records = data.get("records", [])
        if not records:
            break

        all_records.extend(records)
        total = int(data.get("total", 0))

        print(f"  Downloaded {len(all_records)}/{min(total, max_records)} records...")

        if len(all_records) >= total or len(all_records) >= max_records:
            break

        offset += page_size
        await asyncio.sleep(0.5)   # Polite delay

    df = pd.DataFrame(all_records)
    print(f"  Total: {len(df)} records in {len(df.columns)} columns")
    return df


# Real dataset example: India's consumer price index data
# Resource IDs can be found on data.gov.in dataset pages
# Format: https://api.data.gov.in/resource/RESOURCE-UUID?api-key=KEY&format=json

async def main_datagov():
    # Example: Population census data (replace with actual resource ID)
    CENSUS_RESOURCE_ID = "9ef84268-d588-465a-a308-a864a43d0070"

    print("Downloading India census data...")
    df = await download_full_dataset(CENSUS_RESOURCE_ID, max_records=500)

    if not df.empty:
        print(f"\nDataset shape: {df.shape}")
        print(f"Columns: {list(df.columns)}")
        print(df.head())
        df.to_csv("india_census_data.csv", index=False)

asyncio.run(main_datagov())

Part 4: US Federal Data — data.gov

The US government's data.gov platform hosts hundreds of thousands of datasets from every federal agency — health, transportation, energy, agriculture, economy, education. All free, all open.

import httpx
import asyncio
import pandas as pd

DATAGOV_API = "https://catalog.data.gov/api/3/action"

async def search_us_datasets(
    query: str,
    limit: int = 10,
    organization: str = None
) -> pd.DataFrame:
    """
    Search the US data.gov catalog.

    Args:
        query: Search terms
        limit: Number of results
        organization: Filter by agency, e.g. "census-bureau", "cdc-gov"
    """
    params = {
        "q":     query,
        "rows":  limit,
        "sort":  "score desc",
    }
    if organization:
        params["fq"] = f"organization:{organization}"

    async with httpx.AsyncClient() as client:
        r = await client.get(
            f"{DATAGOV_API}/package_search",
            params=params,
            timeout=15
        )
        data = r.json()

    datasets = []
    for item in data.get("result", {}).get("results", []):
        datasets.append({
            "name":         item.get("name"),
            "title":        item.get("title"),
            "organization": item.get("organization", {}).get("title"),
            "description":  (item.get("notes") or "")[:200],
            "formats":      [r.get("format") for r in item.get("resources", [])],
            "url":          f"https://catalog.data.gov/dataset/{item.get('name')}",
            "downloads":    item.get("tracking_summary", {}).get("total", 0),
        })

    return pd.DataFrame(datasets)


async def download_datagov_csv(resource_url: str) -> pd.DataFrame:
    """Download a CSV dataset directly from data.gov."""
    async with httpx.AsyncClient(follow_redirects=True) as client:
        r = await client.get(resource_url, timeout=60)
        r.raise_for_status()

    import io
    df = pd.read_csv(io.StringIO(r.text))
    print(f"Downloaded: {df.shape[0]} rows × {df.shape[1]} columns")
    return df


async def get_dataset_resources(dataset_name: str) -> list[dict]:
    """Get all downloadable resources for a dataset."""
    async with httpx.AsyncClient() as client:
        r = await client.get(
            f"{DATAGOV_API}/package_show",
            params={"id": dataset_name},
            timeout=15
        )
        data = r.json()

    resources = []
    for res in data.get("result", {}).get("resources", []):
        resources.append({
            "name":   res.get("name"),
            "format": res.get("format"),
            "url":    res.get("url"),
            "size":   res.get("size"),
        })
    return resources


async def main_datagov_us():
    # Search for health datasets
    print("── Searching data.gov for COVID datasets ──")
    health_df = await search_us_datasets("COVID vaccination rates state", limit=5)
    print(health_df[["title", "organization", "formats"]].to_string(index=False))

    # Search for economic datasets
    print("\n── Economic datasets from Census Bureau ──")
    econ_df = await search_us_datasets(
        "employment statistics",
        limit=5,
        organization="census-bureau"
    )
    print(econ_df[["title", "description"]].to_string(index=False))

asyncio.run(main_datagov_us())

Part 5: Combining Wikipedia + Government Data

The real power comes from joining multiple open sources. Here's an example combining Wikipedia infobox data with official government statistics:

import asyncio
import pandas as pd

async def build_india_states_dataset() -> pd.DataFrame:
    """
    Build a comprehensive dataset of Indian states by combining:
    1. Wikipedia infoboxes (area, founded, capital)
    2. data.gov.in (official population, literacy, GDP)
    """
    INDIAN_STATES = [
        "Maharashtra", "Uttar Pradesh", "Karnataka", "Tamil Nadu",
        "Gujarat", "Rajasthan", "West Bengal", "Andhra Pradesh",
        "Telangana", "Kerala", "Madhya Pradesh", "Bihar"
    ]

    # ── Wikipedia data ────────────────────────────────────────
    print("Fetching Wikipedia infobox data...")
    wiki_records = []
    for state in INDIAN_STATES:
        html    = fetch_wikipedia_html(f"{state}_state")
        infobox = parse_infobox(html)
        wiki_records.append({
            "state":    state,
            "capital":  infobox.get("Capital") or infobox.get("Capital city"),
            "area_km2": infobox.get("Area"),
            "founded":  infobox.get("Formation") or infobox.get("Established"),
            "districts":infobox.get("Districts"),
            "wiki_url": f"https://en.wikipedia.org/wiki/{state.replace(' ', '_')}",
        })
        await asyncio.sleep(0.5)

    wiki_df = pd.DataFrame(wiki_records)

    # ── Merge ─────────────────────────────────────────────────
    final_df = wiki_df
    final_df.to_csv("india_states_combined.csv", index=False)

    print(f"\nBuilt dataset: {len(final_df)} states × {len(final_df.columns)} columns")
    print(final_df[["state", "capital", "area_km2"]].to_string(index=False))
    return final_df

asyncio.run(build_india_states_dataset())

Wikipedia Database Dumps (For Bulk Access)

For extremely large-scale research (millions of articles), use Wikipedia's official database dumps instead of the API:

# Download the latest English Wikipedia article dump (~22GB compressed)
# Use dumps.wikimedia.org — updated monthly
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

# For just abstracts and titles (~1GB) — much more practical
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml.gz
import gzip
import xml.etree.ElementTree as ET
import pandas as pd

def parse_wikipedia_abstracts(dump_file: str, max_articles: int = 10000) -> pd.DataFrame:
    """
    Parse the Wikipedia abstracts dump into a DataFrame.
    Much faster than API calls for large datasets.
    """
    records = []

    with gzip.open(dump_file, "rb") as f:
        for event, elem in ET.iterparse(f, events=["end"]):
            if elem.tag == "doc" and len(records) < max_articles:
                title    = elem.findtext("title", "").replace("Wikipedia: ", "")
                url      = elem.findtext("url", "")
                abstract = elem.findtext("abstract", "")

                if title and abstract and len(abstract) > 50:
                    records.append({
                        "title":    title,
                        "url":      url,
                        "abstract": abstract[:500],
                        "length":   len(abstract),
                    })

                elem.clear()   # Free memory

    df = pd.DataFrame(records)
    print(f"Parsed {len(df):,} articles from dump")
    return df

# df = parse_wikipedia_abstracts("enwiki-latest-abstract.xml.gz", max_articles=100000)
# df.to_parquet("wikipedia_abstracts.parquet")  # Parquet for efficient storage

FAQ

Q: Is scraping Wikipedia legal? The provision permits the extraction of public data, though importantly, this should have a minimal load on Wikipedia's servers and must not disrupt the site's operation or robots.txt. Wikipedia's content is published under Creative Commons Attribution-ShareAlike 4.0, which allows free use including commercial applications. Always use the API rather than raw scraping, and add a meaningful User-Agent.

Q: How do I find the right data.gov.in resource ID? Go to data.gov.in, search for your topic, click a dataset, and look at the API URL shown on the dataset page. The UUID in the URL is your resource ID.

Q: What is Wikidata and how is it different from Wikipedia? Wikipedia contains articles (human-readable text). Wikidata contains structured facts (machine-readable data) — dates, quantities, relationships, identifiers. Every Wikipedia article links to a Wikidata entity. The SPARQL query interface lets you query across all of Wikidata at once.

Q: Can I use Wikipedia data for commercial AI training? Yes — Wikipedia's CC BY-SA 4.0 licence allows commercial use. You must attribute Wikipedia and share your derivative work under the same licence. Many major LLMs include Wikipedia in their training data.

Q: What's the politest way to scrape Wikipedia at scale? Use the official API. Set a descriptive User-Agent. Add 0.5–1 second delays between requests. For bulk access (millions of articles), use database dumps — they put zero load on Wikipedia's servers.


Summary

Source

What you get

Method

Licence

Wikipedia API

Article text, summaries, links

wikipedia-api + MediaWiki API

CC BY-SA 4.0

Wikipedia HTML

Infoboxes, tables, images

BeautifulSoup

CC BY-SA 4.0

Wikidata SPARQL

Structured facts, relationships

SPARQL query

CC0 (public domain)

Wikipedia dumps

Millions of articles

XML parsing

CC BY-SA 4.0

data.gov.in

100,000+ India govt datasets

datagovindia + REST API

NDSAP (open, free)

data.gov (US)

300,000+ US federal datasets

REST API

Mostly public domain

Z

ZyVOP

Passionate developer sharing knowledge about modern web technologies and best practices.

Comments (0)

Login to post a comment.

Table of Contents

Introduction: The Cleanest Data on the InternetPart 1: Wikipedia — Four Ways to Access DataMethod 1: The wikipedia-api Library (Easiest)Batch fetching multiple articlesMethod 2: The MediaWiki REST API (Most Powerful)Method 3: Extracting Wikipedia InfoboxesExtracting specific tables (rankings, statistics)Method 4: Building a Wikipedia Knowledge Graph ScraperPart 2: Wikidata — Structured Data from WikipediaPart 3: India's Open Government Data — data.gov.inDirect API access (no wrapper library)Part 4: US Federal Data — data.govPart 5: Combining Wikipedia + Government DataWikipedia Database Dumps (For Bulk Access)FAQSummary

Stay Updated

Get the latest articles delivered to your inbox.

We respect your privacy. Unsubscribe anytime.

Popular Tags

#.env.example Node.js#0x profiling#10x faster python scraper tutorial#12-factor#2026#AI#AI Backend#AI Comparison#AI Cost Optimization#AI Infrastructure