Which topics does this article cover?

It highlights python web scraping real estate job data, scrape zillow python 2026, scrape indeed job listings python, python job board scraper, real estate data collection.

Scraping Real Estate & Job Data with Python: Zillow, Indeed & More (2026)

Q: What is "Scraping Real Estate & Job Data with Python: Zillow, Indeed & More (2026)" about?

Real estate and job data power some of the most profitable analytics businesses online. This guide shows how to scrape Zillow, Redfin, Indeed, and Glassdoor, extract listings and salaries, store everything in SQLite, and automate market monitoring with scheduled crawls.

The Two Most Valuable Data Sets on the Web

Two types of data power some of the most valuable business decisions made every day:

Real estate data — property prices, rental rates, square footage, school ratings, neighbourhood trends. Investors use it to identify undervalued markets. Renters use it to compare neighbourhoods. Data scientists use it to build valuation models.

Job market data — which skills are in demand, which companies are hiring, what salaries look like by role and city, how fast the job market is moving. Career switchers, recruiters, compensation analysts, and economists all need this data.

Both are publicly available. Both are on websites with aggressive anti-bot protection. And neither offers a free, comprehensive API.

This guide shows you how to collect both — responsibly, technically correctly, and with production-quality pipelines that actually run without breaking every week.

Part 1: Scraping Real Estate Data

Understanding Your Options

There are three tiers of approach for real estate data:

Source	Method	Pros	Cons
Zillow	Scraping	Rich data, huge coverage	Heavy anti-bot, JS-rendered
Redfin	Scraping / CSV export	Cleaner HTML, allows CSV download	US only
Realtor.com	Scraping	Good SERP-level data	Moderate anti-bot
Government sources	Direct download	Free, accurate, legal	Outdated, less detail
ATTOM Data / CoreLogic	Paid API	Comprehensive, legal, reliable	Expensive

For learning and personal projects, scraping Zillow and Redfin is the most instructive. For commercial applications, use ATTOM Data or a licensed provider.

Scraping Zillow Property Listings with Playwright

Zillow is entirely JavaScript-rendered. BeautifulSoup sees almost nothing useful. You need a full browser — Playwright is the right tool.

pip install playwright playwright-stealth pandas
playwright install chromium

import asyncio
import json
import random
import time
import pandas as pd
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

# ── Utilities ─────────────────────────────────────────────────

async def human_delay(min_s=1.5, max_s=4.0):
    await asyncio.sleep(random.uniform(min_s, max_s))

async def slow_scroll(page, steps=5):
    """Scroll down gradually to trigger lazy-loaded property cards."""
    for _ in range(steps):
        scroll_amount = random.randint(300, 700)
        await page.evaluate(f"window.scrollBy(0, {scroll_amount})")
        await asyncio.sleep(random.uniform(0.4, 1.0))

# ── Main Scraper ────────────────────────────────────────────────

async def scrape_zillow_listings(
    search_query: str,
    max_pages: int = 3
) -> list[dict]:
    """
    Scrape property listings from Zillow search results.

    Args:
        search_query: City/zip/neighbourhood, e.g. "Austin TX" or "Mumbai"
        max_pages: Number of result pages to scrape

    Returns:
        List of property dicts.
    """
    all_listings = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
                "--disable-dev-shm-usage",
            ]
        )

        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1440, "height": 900},
            locale="en-US",
            timezone_id="America/Chicago",
        )

        page = await context.new_page()
        await stealth_async(page)

        # Block images and fonts to load faster
        await context.route(
            "**/*.{png,jpg,jpeg,gif,woff,woff2,ttf}",
            lambda route: route.abort()
        )

        # Encode search and build URL
        encoded = search_query.replace(" ", "-").lower()
        base_url = f"https://www.zillow.com/homes/{encoded}_rb/"

        for page_num in range(1, max_pages + 1):
            url = base_url if page_num == 1 else f"{base_url}{page_num}_p/"
            print(f"Scraping page {page_num}: {url}")

            await page.goto(url, wait_until="domcontentloaded")
            await human_delay(2, 4)
            await slow_scroll(page, steps=6)
            await human_delay(1, 2)

            # Extract from the embedded __NEXT_DATA__ JSON — more reliable
            # than CSS selectors which change frequently
            listings = await page.evaluate("""
                () => {
                    const scripts = document.querySelectorAll(
                        'script[type="application/json"]'
                    );
                    for (const script of scripts) {
                        try {
                            const data = JSON.parse(script.textContent);
                            // Walk the JSON tree to find search results
                            const results = data?.props?.pageProps
                                              ?.searchPageState?.cat1
                                              ?.searchResults?.listResults;
                            if (results) return results;
                        } catch {}
                    }
                    return [];
                }
            """)

            if not listings:
                # Fallback: extract from visible cards via CSS
                listings = await extract_zillow_cards_css(page)

            print(f"  Found {len(listings)} listings on page {page_num}")
            all_listings.extend(listings)

            # Check if there's a next page
            next_btn = await page.query_selector("a[title='Next page']")
            if not next_btn:
                print("  No more pages found.")
                break

            await human_delay(3, 6)

        await browser.close()

    return all_listings


async def extract_zillow_cards_css(page) -> list[dict]:
    """CSS-based fallback extractor for Zillow listing cards."""
    cards = await page.query_selector_all("article.list-card")
    results = []

    for card in cards:
        try:
            price_el   = await card.query_selector(".list-card-price")
            address_el = await card.query_selector("address")
            details_el = await card.query_selector(".list-card-details")
            link_el    = await card.query_selector("a.list-card-link")

            price   = await price_el.inner_text() if price_el else None
            address = await address_el.inner_text() if address_el else None
            details = await details_el.inner_text() if details_el else None
            href    = await link_el.get_attribute("href") if link_el else None

            if price or address:
                results.append({
                    "price":   price,
                    "address": address,
                    "details": details,
                    "url":     f"https://www.zillow.com{href}" if href else None,
                })
        except Exception:
            continue

    return results


def parse_zillow_json_listing(raw: dict) -> dict:
    """Parse a raw Zillow JSON listing into a clean dict."""
    return {
        "address":         raw.get("address"),
        "price":           raw.get("price"),
        "price_raw":       raw.get("unformattedPrice"),
        "beds":            raw.get("beds"),
        "baths":           raw.get("baths"),
        "sqft":            raw.get("area"),
        "property_type":   raw.get("hdpData", {}).get("homeInfo", {})
                              .get("homeType"),
        "days_on_market":  raw.get("hdpData", {}).get("homeInfo", {})
                              .get("daysOnZillow"),
        "zestimate":       raw.get("hdpData", {}).get("homeInfo", {})
                              .get("zestimate"),
        "latitude":        raw.get("latLong", {}).get("latitude"),
        "longitude":       raw.get("latLong", {}).get("longitude"),
        "listing_url":     raw.get("detailUrl"),
        "zpid":            raw.get("zpid"),
        "status":          raw.get("statusType"),
    }

Running the Real Estate Scraper

async def main_real_estate():
    raw_listings = await scrape_zillow_listings(
        search_query="Austin TX",
        max_pages=3
    )

    # Parse JSON listings if available, otherwise use raw CSS-extracted data
    clean = []
    for item in raw_listings:
        if isinstance(item, dict) and "zpid" in item:
            clean.append(parse_zillow_json_listing(item))
        elif isinstance(item, dict):
            clean.append(item)

    df = pd.DataFrame(clean)

    # Clean price column
    if "price_raw" in df.columns:
        df["price_usd"] = pd.to_numeric(
            df["price_raw"].astype(str).str.replace(r"[^\d]", "", regex=True),
            errors="coerce"
        )

    df.to_csv("zillow_listings.csv", index=False)
    print(f"\nSaved {len(df)} listings to zillow_listings.csv")
    print(df[["address", "price", "beds", "baths", "sqft"]].head(10))

asyncio.run(main_real_estate())

Scraping Redfin (Easier Alternative to Zillow)

Redfin is significantly less aggressive with bot detection than Zillow, and even offers a hidden CSV download endpoint:

import httpx
import io
import pandas as pd

async def download_redfin_data(region_id: str, region_type: str = "6") -> pd.DataFrame:
    """
    Download property data directly from Redfin's CSV endpoint.

    To find your region_id:
    1. Search for a city on redfin.com
    2. Look at the URL: /city/30772/TX/Austin → region_id = 30772

    region_type: 2=zip, 4=neighbourhood, 6=city
    """
    url = (
        "https://www.redfin.com/stingray/api/gis-csv"
        f"?region_id={region_id}&region_type={region_type}"
        "&status=1&hoa=0&travel_with_traffic=false&uipt=1,2,3,4,5,6,7,13"
        "&sf=1,2,3,5,6,7&num_homes=350&v=8"
    )

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120",
        "Accept": "text/html,application/xhtml+xml,*/*",
        "Referer": "https://www.redfin.com/",
    }

    async with httpx.AsyncClient(headers=headers) as client:
        response = await client.get(url, timeout=30)
        response.raise_for_status()

    # Response is a CSV — parse directly into pandas
    df = pd.read_csv(io.StringIO(response.text))
    return df

# Austin, TX example
df = asyncio.run(download_redfin_data(region_id="30772", region_type="6"))
print(f"Downloaded {len(df)} listings from Redfin")
print(df.columns.tolist())
print(df[["ADDRESS", "PRICE", "BEDS", "BATHS", "SQUARE FEET"]].head())
df.to_csv("redfin_austin.csv", index=False)

This is clean, fast, and returns hundreds of listings in seconds — making Redfin the preferred source for real estate data when you don't need Zillow-specific features.

Part 2: Scraping Job Listings

Scraping Indeed Job Postings

Indeed is one of the largest job boards in the world. Its job search results are server-rendered HTML, which makes them easier to scrape than Zillow — but it still uses bot detection.

import httpx
import asyncio
import random
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlencode, urljoin

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,*/*;q=0.8",
    "Referer": "https://in.indeed.com/",
}

async def scrape_indeed_jobs(
    job_title: str,
    location: str,
    num_pages: int = 5,
    country: str = "in"   # "in" for India, "com" for US, "co.uk" for UK
) -> list[dict]:
    """
    Scrape job listings from Indeed.

    Args:
        job_title: e.g. "Python Developer", "Data Scientist"
        location:  e.g. "Bangalore", "Mumbai", "Remote"
        num_pages: Pages to scrape (10 jobs per page)
        country:   Domain suffix: "in", "com", "co.uk", "com.au"
    """
    base = f"https://{country}.indeed.com"
    all_jobs = []

    async with httpx.AsyncClient(headers=HEADERS, follow_redirects=True) as client:
        for page in range(num_pages):
            params = {
                "q":      job_title,
                "l":      location,
                "start":  page * 10,
                "sort":   "date",    # Most recent first
                "fromage": "14",     # Jobs from last 14 days
            }
            url = f"{base}/jobs?{urlencode(params)}"
            print(f"  Scraping page {page + 1}: {url}")

            try:
                r = await client.get(url, timeout=20)
                r.raise_for_status()
            except httpx.HTTPStatusError as e:
                print(f"  HTTP {e.response.status_code} on page {page+1}")
                break

            jobs = parse_indeed_page(r.text, base_url=base)
            all_jobs.extend(jobs)
            print(f"  Found {len(jobs)} jobs on page {page + 1}")

            if len(jobs) == 0:
                print("  No more jobs found. Stopping.")
                break

            await asyncio.sleep(random.uniform(2.0, 4.5))

    return all_jobs


def parse_indeed_page(html: str, base_url: str) -> list[dict]:
    """Parse job cards from an Indeed search results page."""
    soup  = BeautifulSoup(html, "lxml")
    cards = soup.select("div.job_seen_beacon")
    jobs  = []

    for card in cards:
        try:
            # Title and URL
            title_el = card.select_one("h2.jobTitle a")
            title    = title_el.get_text(strip=True) if title_el else None
            href     = title_el.get("href", "") if title_el else ""
            job_url  = urljoin(base_url, href)

            # Company
            company_el = card.select_one("[data-testid='company-name']")
            company    = company_el.get_text(strip=True) if company_el else None

            # Location
            location_el = card.select_one("[data-testid='text-location']")
            location    = location_el.get_text(strip=True) if location_el else None

            # Salary (often missing)
            salary_el = card.select_one("[data-testid='attribute_snippet_testid']")
            salary    = salary_el.get_text(strip=True) if salary_el else None

            # Snippet / job description preview
            snippet_el = card.select_one(".job-snippet")
            snippet    = snippet_el.get_text(separator=" ", strip=True) if snippet_el else None

            # Posted date
            date_el = card.select_one("[data-testid='myJobsStateDate']")
            posted  = date_el.get_text(strip=True) if date_el else None

            # Indeed job key (unique ID)
            job_key = card.get("data-jk") or card.find_parent(
                attrs={"data-jk": True}
            )
            job_key = job_key.get("data-jk") if hasattr(job_key, "get") else None

            if title and company:
                jobs.append({
                    "title":    title,
                    "company":  company,
                    "location": location,
                    "salary":   salary,
                    "snippet":  snippet,
                    "posted":   posted,
                    "url":      job_url,
                    "job_key":  job_key,
                })

        except Exception:
            continue

    return jobs


async def run_indeed_scraper():
    searches = [
        ("Python Developer", "Bangalore"),
        ("Data Scientist", "Mumbai"),
        ("Machine Learning Engineer", "Remote"),
    ]

    all_results = []
    for job_title, location in searches:
        print(f"\nSearching: '{job_title}' in '{location}'")
        jobs = await scrape_indeed_jobs(job_title, location, num_pages=3)
        for job in jobs:
            job["search_title"] = job_title
            job["search_location"] = location
        all_results.extend(jobs)

    df = pd.DataFrame(all_results).drop_duplicates(subset=["job_key"])
    df.to_csv("indeed_jobs.csv", index=False)
    print(f"\nTotal unique jobs collected: {len(df)}")
    print(df[["title", "company", "location", "salary"]].head(10))
    return df

asyncio.run(run_indeed_scraper())

Scraping Glassdoor for Salaries and Reviews

Glassdoor requires a login for most data, but salary ranges and job listings are partially accessible without authentication:

from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def scrape_glassdoor_jobs(
    job_title: str,
    location: str,
    num_pages: int = 3
) -> list[dict]:
    """
    Scrape Glassdoor job listings.
    Glassdoor is JS-heavy and requires Playwright.
    """
    all_jobs = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--no-sandbox", "--disable-blink-features=AutomationControlled"]
        )
        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1280, "height": 800},
        )
        page = await context.new_page()
        await stealth_async(page)

        for page_num in range(1, num_pages + 1):
            # Glassdoor URL structure
            title_slug = job_title.lower().replace(" ", "-")
            loc_slug   = location.lower().replace(" ", "-")
            url = (
                f"https://www.glassdoor.co.in/Job/"
                f"{loc_slug}-{title_slug}-jobs-SRCH_IL.0,{len(loc_slug)}"
                f"_IC1278971_KO{len(loc_slug)+1},{len(loc_slug)+len(title_slug)+1}"
                f"_IP{page_num}.htm"
            )

            print(f"Glassdoor page {page_num}: {url}")
            await page.goto(url, wait_until="domcontentloaded")
            await asyncio.sleep(random.uniform(2, 4))

            # Scroll to load lazy-loaded jobs
            for _ in range(4):
                await page.evaluate("window.scrollBy(0, 600)")
                await asyncio.sleep(0.8)

            # Extract job cards
            job_cards = await page.query_selector_all("li.react-job-listing")

            for card in job_cards:
                try:
                    title_el   = await card.query_selector("[data-test='job-title']")
                    company_el = await card.query_selector("[data-test='employer-name']")
                    location_el = await card.query_selector("[data-test='emp-location']")
                    salary_el  = await card.query_selector("[data-test='detailSalary']")
                    rating_el  = await card.query_selector("[data-test='rating']")

                    job = {
                        "title":    await title_el.inner_text() if title_el else None,
                        "company":  await company_el.inner_text() if company_el else None,
                        "location": await location_el.inner_text() if location_el else None,
                        "salary":   await salary_el.inner_text() if salary_el else None,
                        "rating":   await rating_el.inner_text() if rating_el else None,
                        "source":   "glassdoor",
                    }
                    if job["title"]:
                        all_jobs.append(job)
                except Exception:
                    continue

            print(f"  Collected {len(all_jobs)} total jobs so far")
            await asyncio.sleep(random.uniform(3, 6))

        await browser.close()

    return all_jobs

Part 3: The Combined Data Pipeline

Now let's build a unified pipeline that scrapes from multiple sources, deduplicates, enriches, and stores everything:

import asyncio
import pandas as pd
import sqlalchemy as sa
from datetime import datetime, timezone

class JobMarketPipeline:
    """
    Unified pipeline for job market data from multiple sources.
    Stores to SQLite for lightweight, portable storage.
    """

    DB_URL = "sqlite:///job_market.db"

    def __init__(self):
        self.engine = sa.create_engine(self.DB_URL)
        self._ensure_schema()

    def _ensure_schema(self):
        with self.engine.connect() as conn:
            conn.execute(sa.text("""
                CREATE TABLE IF NOT EXISTS jobs (
                    id          INTEGER PRIMARY KEY AUTOINCREMENT,
                    title       TEXT,
                    company     TEXT,
                    location    TEXT,
                    salary_raw  TEXT,
                    salary_min  REAL,
                    salary_max  REAL,
                    snippet     TEXT,
                    source      TEXT,
                    search_query TEXT,
                    url         TEXT UNIQUE,
                    scraped_at  TEXT
                )
            """))
            conn.commit()

    def _parse_salary(self, salary_str: str) -> tuple[float | None, float | None]:
        """Extract min/max salary from strings like '₹8L–₹15L/yr' or '$80k–$120k'."""
        if not salary_str:
            return None, None
        import re
        numbers = re.findall(r"[\d,]+\.?\d*", salary_str.replace(",", ""))
        if len(numbers) >= 2:
            return float(numbers[0]), float(numbers[1])
        elif len(numbers) == 1:
            return float(numbers[0]), None
        return None, None

    def save_jobs(self, jobs: list[dict], search_query: str):
        """Save job listings to SQLite, skipping duplicates by URL."""
        records = []
        now = datetime.now(timezone.utc).isoformat()

        for job in jobs:
            salary_min, salary_max = self._parse_salary(job.get("salary"))
            records.append({
                "title":        job.get("title"),
                "company":      job.get("company"),
                "location":     job.get("location"),
                "salary_raw":   job.get("salary"),
                "salary_min":   salary_min,
                "salary_max":   salary_max,
                "snippet":      job.get("snippet"),
                "source":       job.get("source", "indeed"),
                "search_query": search_query,
                "url":          job.get("url"),
                "scraped_at":   now,
            })

        df = pd.DataFrame(records).drop_duplicates(subset=["url"])
        df.to_sql("jobs", self.engine, if_exists="append", index=False,
                  method="multi")
        print(f"  Saved {len(df)} new jobs to database.")

    def get_salary_analysis(self) -> pd.DataFrame:
        """Analyse salary distributions by job title."""
        query = """
            SELECT
                search_query,
                COUNT(*) as total_listings,
                COUNT(salary_min) as listings_with_salary,
                ROUND(AVG(salary_min), 0) as avg_min_salary,
                ROUND(AVG(salary_max), 0) as avg_max_salary,
                ROUND(MIN(salary_min), 0) as lowest_salary,
                ROUND(MAX(salary_max), 0) as highest_salary
            FROM jobs
            WHERE salary_min IS NOT NULL
            GROUP BY search_query
            ORDER BY avg_min_salary DESC
        """
        return pd.read_sql(query, self.engine)

    def get_top_hiring_companies(self, search_query: str = None) -> pd.DataFrame:
        """Find which companies are hiring the most."""
        where = f"WHERE search_query = '{search_query}'" if search_query else ""
        query = f"""
            SELECT
                company,
                COUNT(*) as open_positions,
                COUNT(DISTINCT location) as cities
            FROM jobs
            {where}
            GROUP BY company
            HAVING COUNT(*) >= 2
            ORDER BY open_positions DESC
            LIMIT 20
        """
        return pd.read_sql(query, self.engine)

    def get_location_demand(self) -> pd.DataFrame:
        """Rank cities by number of job postings."""
        query = """
            SELECT
                location,
                COUNT(*) as job_count,
                COUNT(DISTINCT company) as unique_companies
            FROM jobs
            WHERE location IS NOT NULL
              AND location != ''
            GROUP BY location
            ORDER BY job_count DESC
            LIMIT 15
        """
        return pd.read_sql(query, self.engine)


async def run_full_pipeline():
    """
    Complete pipeline: scrape Indeed for multiple roles → store → analyse.
    """
    pipeline = JobMarketPipeline()

    searches = [
        ("Python Developer",           "Bangalore"),
        ("Data Scientist",             "Hyderabad"),
        ("Machine Learning Engineer",  "Remote"),
        ("Backend Engineer",           "Mumbai"),
        ("DevOps Engineer",            "Pune"),
    ]

    for job_title, location in searches:
        print(f"\n── Scraping: '{job_title}' in '{location}' ──")
        jobs = await scrape_indeed_jobs(job_title, location, num_pages=3)
        pipeline.save_jobs(jobs, search_query=job_title)
        await asyncio.sleep(random.uniform(2, 5))

    # ── Analysis ────────────────────────────────────────────────
    print("\n\n══ JOB MARKET ANALYSIS ══\n")

    print("── Salary Analysis by Role ──")
    salary_df = pipeline.get_salary_analysis()
    print(salary_df.to_string(index=False))

    print("\n── Top Hiring Companies ──")
    companies_df = pipeline.get_top_hiring_companies()
    print(companies_df.head(10).to_string(index=False))

    print("\n── Demand by Location ──")
    location_df = pipeline.get_location_demand()
    print(location_df.to_string(index=False))

    # Export final report
    all_jobs = pd.read_sql("SELECT * FROM jobs", pipeline.engine)
    all_jobs.to_csv("job_market_report.csv", index=False)
    print(f"\nFull dataset: {len(all_jobs)} jobs saved to job_market_report.csv")

asyncio.run(run_full_pipeline())

Part 4: Real Estate Price Analysis

Once you have Redfin or Zillow data, here's how to turn it into useful insights:

import pandas as pd
import json

def analyse_real_estate(csv_file: str) -> dict:
    """
    Load and analyse a real estate dataset.
    Works with both Redfin CSV and processed Zillow data.
    """
    df = pd.read_csv(csv_file)

    # Standardise column names
    col_map = {
        "PRICE":         "price",
        "BEDS":          "beds",
        "BATHS":         "baths",
        "SQUARE FEET":   "sqft",
        "ADDRESS":       "address",
        "CITY":          "city",
        "price_raw":     "price",
        "beds":          "beds",
    }
    df = df.rename(columns={k: v for k, v in col_map.items() if k in df.columns})

    # Clean numeric columns
    for col in ["price", "sqft"]:
        if col in df.columns:
            df[col] = pd.to_numeric(
                df[col].astype(str).str.replace(r"[^\d.]", "", regex=True),
                errors="coerce"
            )

    # Price per sqft
    if "price" in df.columns and "sqft" in df.columns:
        df["price_per_sqft"] = (df["price"] / df["sqft"]).round(2)

    summary = {
        "total_listings":      len(df),
        "median_price":        df["price"].median() if "price" in df.columns else None,
        "avg_price":           df["price"].mean() if "price" in df.columns else None,
        "min_price":           df["price"].min() if "price" in df.columns else None,
        "max_price":           df["price"].max() if "price" in df.columns else None,
        "avg_price_per_sqft":  df["price_per_sqft"].mean() if "price_per_sqft" in df.columns else None,
        "avg_sqft":            df["sqft"].mean() if "sqft" in df.columns else None,
    }

    print("── Real Estate Summary ──")
    for k, v in summary.items():
        if v is not None:
            print(f"  {k:30s}: {v:,.0f}")

    # Price distribution by beds
    if "beds" in df.columns and "price" in df.columns:
        by_beds = df.groupby("beds")["price"].agg(["count", "median", "mean"])
        by_beds.columns = ["count", "median_price", "avg_price"]
        print("\n── Prices by Bedroom Count ──")
        print(by_beds.to_string())

    return summary, df


summary, df = analyse_real_estate("redfin_austin.csv")
df.to_csv("redfin_austin_clean.csv", index=False)

Scheduling Weekly Data Collection

To track market trends over time, run your scrapers on a weekly schedule:

import asyncio
import schedule
import time
from datetime import datetime

async def weekly_job_collection():
    """Run the full job scraping pipeline every Monday."""
    print(f"\n[{datetime.now().isoformat()}] Starting weekly job collection...")
    await run_full_pipeline()
    print("Weekly collection complete.")

def run_async_job():
    asyncio.run(weekly_job_collection())

# Schedule for every Monday at 6 AM
schedule.every().monday.at("06:00").do(run_async_job)

print("Scheduler started. Waiting for next run...")
while True:
    schedule.run_pending()
    time.sleep(60)

Or use a simple cron job on Linux/Mac:

# Edit crontab: crontab -e
# Run every Monday at 6am
0 6 * * 1 cd /path/to/project && python scraper.py >> logs/scraper.log 2>&1

Handling the Most Common Anti-Bot Blocks

Both Zillow and Indeed use anti-bot systems that evolve regularly. Here are the top issues and solutions:

Problem: Zillow shows "Sorry, we couldn't find that page" Zillow detected the automation. Solutions:

Use headless=False to confirm the page renders correctly
Add playwright-stealth stealth patching
Increase delays between page loads to 5–10 seconds
Use a residential proxy with the proxy= parameter in new_context()

Problem: Indeed returns a CAPTCHA page Your IP has been rate-limited. Solutions:

Slow down to 4–8 second delays between requests
Rotate User-Agent strings across requests
Use curl_cffi with impersonate="chrome120" instead of httpx

Problem: CSS selectors stopped finding elements Both sites update their HTML structure regularly. Fix:

Open the site in DevTools, find the current element classes
Update your selectors accordingly
Consider targeting data-testid attributes — these are more stable than CSS classes

Problem: Empty results on Glassdoor Glassdoor aggressively detects headless browsers. Solutions:

Run in headless=False mode during development to debug
Add more human-like scrolling and delays
Consider Glassdoor's own API for authorised developer access

FAQ

Q: Is it legal to scrape Zillow and Indeed? Both sites' Terms of Service prohibit automated scraping. For personal research and learning, the risk is minimal — neither typically pursues individuals. For commercial applications (building a competing product, reselling data), use licensed data providers: CoreLogic or ATTOM for real estate; LinkedIn Talent Insights or Burning Glass for jobs.

Q: How often should I run the scraper? For job market trend analysis: weekly is sufficient. For real-time job alerts: daily. For real estate market analysis: weekly or bi-weekly. Running more frequently increases ban risk without adding proportional value.

Q: What's the best free real estate data source? For US data, the Redfin CSV download endpoint (shown above) is the most accessible free source. For international data, government land registry APIs (UK Land Registry, Indian Registration Department portals) provide official transaction data.

Q: Can I scrape salary data from LinkedIn? LinkedIn shows salary ranges on some job postings. The scraping approach is similar to Blog 02 (LinkedIn scraping with Playwright) — use saved cookies, stealth mode, and conservative delays. For systematic salary research, LinkedIn Talent Insights is the official paid option.

Q: How do I avoid my IP being banned permanently? Keep daily request volume low (under 200/day per IP), use residential proxies for volume work, respect robots.txt, and never attempt to log in automatically. If you get a temporary block, wait 24 hours before retrying from a different IP.

Summary

Task	Tool	Key Notes
Zillow scraping	Playwright + stealth	JS-rendered, needs browser automation
Redfin data	httpx + CSV endpoint	Easier, returns structured CSV directly
Indeed job scraping	httpx + BeautifulSoup	Server-rendered, moderate anti-bot
Glassdoor job scraping	Playwright + stealth	JS-heavy, requires browser
Data storage	SQLite + pandas	Lightweight, portable, queryable
Salary analysis	pandas groupby	Extract min/max from raw strings
Scheduling	schedule / cron	Weekly collection for trend analysis

The Two Most Valuable Data Sets on the Web

Two types of data power some of the most valuable business decisions made every day:

Both are publicly available. Both are on websites with aggressive anti-bot protection. And neither offers a free, comprehensive API.

This guide shows you how to collect both — responsibly, technically correctly, and with production-quality pipelines that actually run without breaking every week.

Part 1: Scraping Real Estate Data

Understanding Your Options

There are three tiers of approach for real estate data:

Source	Method	Pros	Cons
Zillow	Scraping	Rich data, huge coverage	Heavy anti-bot, JS-rendered
Redfin	Scraping / CSV export	Cleaner HTML, allows CSV download	US only
Realtor.com	Scraping	Good SERP-level data	Moderate anti-bot
Government sources	Direct download	Free, accurate, legal	Outdated, less detail
ATTOM Data / CoreLogic	Paid API	Comprehensive, legal, reliable	Expensive

For learning and personal projects, scraping Zillow and Redfin is the most instructive. For commercial applications, use ATTOM Data or a licensed provider.

Scraping Zillow Property Listings with Playwright

Zillow is entirely JavaScript-rendered. BeautifulSoup sees almost nothing useful. You need a full browser — Playwright is the right tool.

pip install playwright playwright-stealth pandas
playwright install chromium

import asyncio
import json
import random
import time
import pandas as pd
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

# ── Utilities ─────────────────────────────────────────────────

async def human_delay(min_s=1.5, max_s=4.0):
    await asyncio.sleep(random.uniform(min_s, max_s))

async def slow_scroll(page, steps=5):
    """Scroll down gradually to trigger lazy-loaded property cards."""
    for _ in range(steps):
        scroll_amount = random.randint(300, 700)
        await page.evaluate(f"window.scrollBy(0, {scroll_amount})")
        await asyncio.sleep(random.uniform(0.4, 1.0))

# ── Main Scraper ────────────────────────────────────────────────

async def scrape_zillow_listings(
    search_query: str,
    max_pages: int = 3
) -> list[dict]:
    """
    Scrape property listings from Zillow search results.

    Args:
        search_query: City/zip/neighbourhood, e.g. "Austin TX" or "Mumbai"
        max_pages: Number of result pages to scrape

    Returns:
        List of property dicts.
    """
    all_listings = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
                "--disable-dev-shm-usage",
            ]
        )

        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1440, "height": 900},
            locale="en-US",
            timezone_id="America/Chicago",
        )

        page = await context.new_page()
        await stealth_async(page)

        # Block images and fonts to load faster
        await context.route(
            "**/*.{png,jpg,jpeg,gif,woff,woff2,ttf}",
            lambda route: route.abort()
        )

        # Encode search and build URL
        encoded = search_query.replace(" ", "-").lower()
        base_url = f"https://www.zillow.com/homes/{encoded}_rb/"

        for page_num in range(1, max_pages + 1):
            url = base_url if page_num == 1 else f"{base_url}{page_num}_p/"
            print(f"Scraping page {page_num}: {url}")

            await page.goto(url, wait_until="domcontentloaded")
            await human_delay(2, 4)
            await slow_scroll(page, steps=6)
            await human_delay(1, 2)

            # Extract from the embedded __NEXT_DATA__ JSON — more reliable
            # than CSS selectors which change frequently
            listings = await page.evaluate("""
                () => {
                    const scripts = document.querySelectorAll(
                        'script[type="application/json"]'
                    );
                    for (const script of scripts) {
                        try {
                            const data = JSON.parse(script.textContent);
                            // Walk the JSON tree to find search results
                            const results = data?.props?.pageProps
                                              ?.searchPageState?.cat1
                                              ?.searchResults?.listResults;
                            if (results) return results;
                        } catch {}
                    }
                    return [];
                }
            """)

            if not listings:
                # Fallback: extract from visible cards via CSS
                listings = await extract_zillow_cards_css(page)

            print(f"  Found {len(listings)} listings on page {page_num}")
            all_listings.extend(listings)

            # Check if there's a next page
            next_btn = await page.query_selector("a[title='Next page']")
            if not next_btn:
                print("  No more pages found.")
                break

            await human_delay(3, 6)

        await browser.close()

    return all_listings


async def extract_zillow_cards_css(page) -> list[dict]:
    """CSS-based fallback extractor for Zillow listing cards."""
    cards = await page.query_selector_all("article.list-card")
    results = []

    for card in cards:
        try:
            price_el   = await card.query_selector(".list-card-price")
            address_el = await card.query_selector("address")
            details_el = await card.query_selector(".list-card-details")
            link_el    = await card.query_selector("a.list-card-link")

            price   = await price_el.inner_text() if price_el else None
            address = await address_el.inner_text() if address_el else None
            details = await details_el.inner_text() if details_el else None
            href    = await link_el.get_attribute("href") if link_el else None

            if price or address:
                results.append({
                    "price":   price,
                    "address": address,
                    "details": details,
                    "url":     f"https://www.zillow.com{href}" if href else None,
                })
        except Exception:
            continue

    return results


def parse_zillow_json_listing(raw: dict) -> dict:
    """Parse a raw Zillow JSON listing into a clean dict."""
    return {
        "address":         raw.get("address"),
        "price":           raw.get("price"),
        "price_raw":       raw.get("unformattedPrice"),
        "beds":            raw.get("beds"),
        "baths":           raw.get("baths"),
        "sqft":            raw.get("area"),
        "property_type":   raw.get("hdpData", {}).get("homeInfo", {})
                              .get("homeType"),
        "days_on_market":  raw.get("hdpData", {}).get("homeInfo", {})
                              .get("daysOnZillow"),
        "zestimate":       raw.get("hdpData", {}).get("homeInfo", {})
                              .get("zestimate"),
        "latitude":        raw.get("latLong", {}).get("latitude"),
        "longitude":       raw.get("latLong", {}).get("longitude"),
        "listing_url":     raw.get("detailUrl"),
        "zpid":            raw.get("zpid"),
        "status":          raw.get("statusType"),
    }

Running the Real Estate Scraper

async def main_real_estate():
    raw_listings = await scrape_zillow_listings(
        search_query="Austin TX",
        max_pages=3
    )

    # Parse JSON listings if available, otherwise use raw CSS-extracted data
    clean = []
    for item in raw_listings:
        if isinstance(item, dict) and "zpid" in item:
            clean.append(parse_zillow_json_listing(item))
        elif isinstance(item, dict):
            clean.append(item)

    df = pd.DataFrame(clean)

    # Clean price column
    if "price_raw" in df.columns:
        df["price_usd"] = pd.to_numeric(
            df["price_raw"].astype(str).str.replace(r"[^\d]", "", regex=True),
            errors="coerce"
        )

    df.to_csv("zillow_listings.csv", index=False)
    print(f"\nSaved {len(df)} listings to zillow_listings.csv")
    print(df[["address", "price", "beds", "baths", "sqft"]].head(10))

asyncio.run(main_real_estate())

Scraping Redfin (Easier Alternative to Zillow)

Redfin is significantly less aggressive with bot detection than Zillow, and even offers a hidden CSV download endpoint:

import httpx
import io
import pandas as pd

async def download_redfin_data(region_id: str, region_type: str = "6") -> pd.DataFrame:
    """
    Download property data directly from Redfin's CSV endpoint.

    To find your region_id:
    1. Search for a city on redfin.com
    2. Look at the URL: /city/30772/TX/Austin → region_id = 30772

    region_type: 2=zip, 4=neighbourhood, 6=city
    """
    url = (
        "https://www.redfin.com/stingray/api/gis-csv"
        f"?region_id={region_id}&region_type={region_type}"
        "&status=1&hoa=0&travel_with_traffic=false&uipt=1,2,3,4,5,6,7,13"
        "&sf=1,2,3,5,6,7&num_homes=350&v=8"
    )

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120",
        "Accept": "text/html,application/xhtml+xml,*/*",
        "Referer": "https://www.redfin.com/",
    }

    async with httpx.AsyncClient(headers=headers) as client:
        response = await client.get(url, timeout=30)
        response.raise_for_status()

    # Response is a CSV — parse directly into pandas
    df = pd.read_csv(io.StringIO(response.text))
    return df

# Austin, TX example
df = asyncio.run(download_redfin_data(region_id="30772", region_type="6"))
print(f"Downloaded {len(df)} listings from Redfin")
print(df.columns.tolist())
print(df[["ADDRESS", "PRICE", "BEDS", "BATHS", "SQUARE FEET"]].head())
df.to_csv("redfin_austin.csv", index=False)

This is clean, fast, and returns hundreds of listings in seconds — making Redfin the preferred source for real estate data when you don't need Zillow-specific features.

Part 2: Scraping Job Listings

Scraping Indeed Job Postings

Indeed is one of the largest job boards in the world. Its job search results are server-rendered HTML, which makes them easier to scrape than Zillow — but it still uses bot detection.

import httpx
import asyncio
import random
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlencode, urljoin

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,*/*;q=0.8",
    "Referer": "https://in.indeed.com/",
}

async def scrape_indeed_jobs(
    job_title: str,
    location: str,
    num_pages: int = 5,
    country: str = "in"   # "in" for India, "com" for US, "co.uk" for UK
) -> list[dict]:
    """
    Scrape job listings from Indeed.

    Args:
        job_title: e.g. "Python Developer", "Data Scientist"
        location:  e.g. "Bangalore", "Mumbai", "Remote"
        num_pages: Pages to scrape (10 jobs per page)
        country:   Domain suffix: "in", "com", "co.uk", "com.au"
    """
    base = f"https://{country}.indeed.com"
    all_jobs = []

    async with httpx.AsyncClient(headers=HEADERS, follow_redirects=True) as client:
        for page in range(num_pages):
            params = {
                "q":      job_title,
                "l":      location,
                "start":  page * 10,
                "sort":   "date",    # Most recent first
                "fromage": "14",     # Jobs from last 14 days
            }
            url = f"{base}/jobs?{urlencode(params)}"
            print(f"  Scraping page {page + 1}: {url}")

            try:
                r = await client.get(url, timeout=20)
                r.raise_for_status()
            except httpx.HTTPStatusError as e:
                print(f"  HTTP {e.response.status_code} on page {page+1}")
                break

            jobs = parse_indeed_page(r.text, base_url=base)
            all_jobs.extend(jobs)
            print(f"  Found {len(jobs)} jobs on page {page + 1}")

            if len(jobs) == 0:
                print("  No more jobs found. Stopping.")
                break

            await asyncio.sleep(random.uniform(2.0, 4.5))

    return all_jobs


def parse_indeed_page(html: str, base_url: str) -> list[dict]:
    """Parse job cards from an Indeed search results page."""
    soup  = BeautifulSoup(html, "lxml")
    cards = soup.select("div.job_seen_beacon")
    jobs  = []

    for card in cards:
        try:
            # Title and URL
            title_el = card.select_one("h2.jobTitle a")
            title    = title_el.get_text(strip=True) if title_el else None
            href     = title_el.get("href", "") if title_el else ""
            job_url  = urljoin(base_url, href)

            # Company
            company_el = card.select_one("[data-testid='company-name']")
            company    = company_el.get_text(strip=True) if company_el else None

            # Location
            location_el = card.select_one("[data-testid='text-location']")
            location    = location_el.get_text(strip=True) if location_el else None

            # Salary (often missing)
            salary_el = card.select_one("[data-testid='attribute_snippet_testid']")
            salary    = salary_el.get_text(strip=True) if salary_el else None

            # Snippet / job description preview
            snippet_el = card.select_one(".job-snippet")
            snippet    = snippet_el.get_text(separator=" ", strip=True) if snippet_el else None

            # Posted date
            date_el = card.select_one("[data-testid='myJobsStateDate']")
            posted  = date_el.get_text(strip=True) if date_el else None

            # Indeed job key (unique ID)
            job_key = card.get("data-jk") or card.find_parent(
                attrs={"data-jk": True}
            )
            job_key = job_key.get("data-jk") if hasattr(job_key, "get") else None

            if title and company:
                jobs.append({
                    "title":    title,
                    "company":  company,
                    "location": location,
                    "salary":   salary,
                    "snippet":  snippet,
                    "posted":   posted,
                    "url":      job_url,
                    "job_key":  job_key,
                })

        except Exception:
            continue

    return jobs


async def run_indeed_scraper():
    searches = [
        ("Python Developer", "Bangalore"),
        ("Data Scientist", "Mumbai"),
        ("Machine Learning Engineer", "Remote"),
    ]

    all_results = []
    for job_title, location in searches:
        print(f"\nSearching: '{job_title}' in '{location}'")
        jobs = await scrape_indeed_jobs(job_title, location, num_pages=3)
        for job in jobs:
            job["search_title"] = job_title
            job["search_location"] = location
        all_results.extend(jobs)

    df = pd.DataFrame(all_results).drop_duplicates(subset=["job_key"])
    df.to_csv("indeed_jobs.csv", index=False)
    print(f"\nTotal unique jobs collected: {len(df)}")
    print(df[["title", "company", "location", "salary"]].head(10))
    return df

asyncio.run(run_indeed_scraper())

Scraping Glassdoor for Salaries and Reviews

Glassdoor requires a login for most data, but salary ranges and job listings are partially accessible without authentication:

from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def scrape_glassdoor_jobs(
    job_title: str,
    location: str,
    num_pages: int = 3
) -> list[dict]:
    """
    Scrape Glassdoor job listings.
    Glassdoor is JS-heavy and requires Playwright.
    """
    all_jobs = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--no-sandbox", "--disable-blink-features=AutomationControlled"]
        )
        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1280, "height": 800},
        )
        page = await context.new_page()
        await stealth_async(page)

        for page_num in range(1, num_pages + 1):
            # Glassdoor URL structure
            title_slug = job_title.lower().replace(" ", "-")
            loc_slug   = location.lower().replace(" ", "-")
            url = (
                f"https://www.glassdoor.co.in/Job/"
                f"{loc_slug}-{title_slug}-jobs-SRCH_IL.0,{len(loc_slug)}"
                f"_IC1278971_KO{len(loc_slug)+1},{len(loc_slug)+len(title_slug)+1}"
                f"_IP{page_num}.htm"
            )

            print(f"Glassdoor page {page_num}: {url}")
            await page.goto(url, wait_until="domcontentloaded")
            await asyncio.sleep(random.uniform(2, 4))

            # Scroll to load lazy-loaded jobs
            for _ in range(4):
                await page.evaluate("window.scrollBy(0, 600)")
                await asyncio.sleep(0.8)

            # Extract job cards
            job_cards = await page.query_selector_all("li.react-job-listing")

            for card in job_cards:
                try:
                    title_el   = await card.query_selector("[data-test='job-title']")
                    company_el = await card.query_selector("[data-test='employer-name']")
                    location_el = await card.query_selector("[data-test='emp-location']")
                    salary_el  = await card.query_selector("[data-test='detailSalary']")
                    rating_el  = await card.query_selector("[data-test='rating']")

                    job = {
                        "title":    await title_el.inner_text() if title_el else None,
                        "company":  await company_el.inner_text() if company_el else None,
                        "location": await location_el.inner_text() if location_el else None,
                        "salary":   await salary_el.inner_text() if salary_el else None,
                        "rating":   await rating_el.inner_text() if rating_el else None,
                        "source":   "glassdoor",
                    }
                    if job["title"]:
                        all_jobs.append(job)
                except Exception:
                    continue

            print(f"  Collected {len(all_jobs)} total jobs so far")
            await asyncio.sleep(random.uniform(3, 6))

        await browser.close()

    return all_jobs

Part 3: The Combined Data Pipeline

Now let's build a unified pipeline that scrapes from multiple sources, deduplicates, enriches, and stores everything:

import asyncio
import pandas as pd
import sqlalchemy as sa
from datetime import datetime, timezone

class JobMarketPipeline:
    """
    Unified pipeline for job market data from multiple sources.
    Stores to SQLite for lightweight, portable storage.
    """

    DB_URL = "sqlite:///job_market.db"

    def __init__(self):
        self.engine = sa.create_engine(self.DB_URL)
        self._ensure_schema()

    def _ensure_schema(self):
        with self.engine.connect() as conn:
            conn.execute(sa.text("""
                CREATE TABLE IF NOT EXISTS jobs (
                    id          INTEGER PRIMARY KEY AUTOINCREMENT,
                    title       TEXT,
                    company     TEXT,
                    location    TEXT,
                    salary_raw  TEXT,
                    salary_min  REAL,
                    salary_max  REAL,
                    snippet     TEXT,
                    source      TEXT,
                    search_query TEXT,
                    url         TEXT UNIQUE,
                    scraped_at  TEXT
                )
            """))
            conn.commit()

    def _parse_salary(self, salary_str: str) -> tuple[float | None, float | None]:
        """Extract min/max salary from strings like '₹8L–₹15L/yr' or '$80k–$120k'."""
        if not salary_str:
            return None, None
        import re
        numbers = re.findall(r"[\d,]+\.?\d*", salary_str.replace(",", ""))
        if len(numbers) >= 2:
            return float(numbers[0]), float(numbers[1])
        elif len(numbers) == 1:
            return float(numbers[0]), None
        return None, None

    def save_jobs(self, jobs: list[dict], search_query: str):
        """Save job listings to SQLite, skipping duplicates by URL."""
        records = []
        now = datetime.now(timezone.utc).isoformat()

        for job in jobs:
            salary_min, salary_max = self._parse_salary(job.get("salary"))
            records.append({
                "title":        job.get("title"),
                "company":      job.get("company"),
                "location":     job.get("location"),
                "salary_raw":   job.get("salary"),
                "salary_min":   salary_min,
                "salary_max":   salary_max,
                "snippet":      job.get("snippet"),
                "source":       job.get("source", "indeed"),
                "search_query": search_query,
                "url":          job.get("url"),
                "scraped_at":   now,
            })

        df = pd.DataFrame(records).drop_duplicates(subset=["url"])
        df.to_sql("jobs", self.engine, if_exists="append", index=False,
                  method="multi")
        print(f"  Saved {len(df)} new jobs to database.")

    def get_salary_analysis(self) -> pd.DataFrame:
        """Analyse salary distributions by job title."""
        query = """
            SELECT
                search_query,
                COUNT(*) as total_listings,
                COUNT(salary_min) as listings_with_salary,
                ROUND(AVG(salary_min), 0) as avg_min_salary,
                ROUND(AVG(salary_max), 0) as avg_max_salary,
                ROUND(MIN(salary_min), 0) as lowest_salary,
                ROUND(MAX(salary_max), 0) as highest_salary
            FROM jobs
            WHERE salary_min IS NOT NULL
            GROUP BY search_query
            ORDER BY avg_min_salary DESC
        """
        return pd.read_sql(query, self.engine)

    def get_top_hiring_companies(self, search_query: str = None) -> pd.DataFrame:
        """Find which companies are hiring the most."""
        where = f"WHERE search_query = '{search_query}'" if search_query else ""
        query = f"""
            SELECT
                company,
                COUNT(*) as open_positions,
                COUNT(DISTINCT location) as cities
            FROM jobs
            {where}
            GROUP BY company
            HAVING COUNT(*) >= 2
            ORDER BY open_positions DESC
            LIMIT 20
        """
        return pd.read_sql(query, self.engine)

    def get_location_demand(self) -> pd.DataFrame:
        """Rank cities by number of job postings."""
        query = """
            SELECT
                location,
                COUNT(*) as job_count,
                COUNT(DISTINCT company) as unique_companies
            FROM jobs
            WHERE location IS NOT NULL
              AND location != ''
            GROUP BY location
            ORDER BY job_count DESC
            LIMIT 15
        """
        return pd.read_sql(query, self.engine)


async def run_full_pipeline():
    """
    Complete pipeline: scrape Indeed for multiple roles → store → analyse.
    """
    pipeline = JobMarketPipeline()

    searches = [
        ("Python Developer",           "Bangalore"),
        ("Data Scientist",             "Hyderabad"),
        ("Machine Learning Engineer",  "Remote"),
        ("Backend Engineer",           "Mumbai"),
        ("DevOps Engineer",            "Pune"),
    ]

    for job_title, location in searches:
        print(f"\n── Scraping: '{job_title}' in '{location}' ──")
        jobs = await scrape_indeed_jobs(job_title, location, num_pages=3)
        pipeline.save_jobs(jobs, search_query=job_title)
        await asyncio.sleep(random.uniform(2, 5))

    # ── Analysis ────────────────────────────────────────────────
    print("\n\n══ JOB MARKET ANALYSIS ══\n")

    print("── Salary Analysis by Role ──")
    salary_df = pipeline.get_salary_analysis()
    print(salary_df.to_string(index=False))

    print("\n── Top Hiring Companies ──")
    companies_df = pipeline.get_top_hiring_companies()
    print(companies_df.head(10).to_string(index=False))

    print("\n── Demand by Location ──")
    location_df = pipeline.get_location_demand()
    print(location_df.to_string(index=False))

    # Export final report
    all_jobs = pd.read_sql("SELECT * FROM jobs", pipeline.engine)
    all_jobs.to_csv("job_market_report.csv", index=False)
    print(f"\nFull dataset: {len(all_jobs)} jobs saved to job_market_report.csv")

asyncio.run(run_full_pipeline())

Part 4: Real Estate Price Analysis

Once you have Redfin or Zillow data, here's how to turn it into useful insights:

import pandas as pd
import json

def analyse_real_estate(csv_file: str) -> dict:
    """
    Load and analyse a real estate dataset.
    Works with both Redfin CSV and processed Zillow data.
    """
    df = pd.read_csv(csv_file)

    # Standardise column names
    col_map = {
        "PRICE":         "price",
        "BEDS":          "beds",
        "BATHS":         "baths",
        "SQUARE FEET":   "sqft",
        "ADDRESS":       "address",
        "CITY":          "city",
        "price_raw":     "price",
        "beds":          "beds",
    }
    df = df.rename(columns={k: v for k, v in col_map.items() if k in df.columns})

    # Clean numeric columns
    for col in ["price", "sqft"]:
        if col in df.columns:
            df[col] = pd.to_numeric(
                df[col].astype(str).str.replace(r"[^\d.]", "", regex=True),
                errors="coerce"
            )

    # Price per sqft
    if "price" in df.columns and "sqft" in df.columns:
        df["price_per_sqft"] = (df["price"] / df["sqft"]).round(2)

    summary = {
        "total_listings":      len(df),
        "median_price":        df["price"].median() if "price" in df.columns else None,
        "avg_price":           df["price"].mean() if "price" in df.columns else None,
        "min_price":           df["price"].min() if "price" in df.columns else None,
        "max_price":           df["price"].max() if "price" in df.columns else None,
        "avg_price_per_sqft":  df["price_per_sqft"].mean() if "price_per_sqft" in df.columns else None,
        "avg_sqft":            df["sqft"].mean() if "sqft" in df.columns else None,
    }

    print("── Real Estate Summary ──")
    for k, v in summary.items():
        if v is not None:
            print(f"  {k:30s}: {v:,.0f}")

    # Price distribution by beds
    if "beds" in df.columns and "price" in df.columns:
        by_beds = df.groupby("beds")["price"].agg(["count", "median", "mean"])
        by_beds.columns = ["count", "median_price", "avg_price"]
        print("\n── Prices by Bedroom Count ──")
        print(by_beds.to_string())

    return summary, df


summary, df = analyse_real_estate("redfin_austin.csv")
df.to_csv("redfin_austin_clean.csv", index=False)

Scheduling Weekly Data Collection

To track market trends over time, run your scrapers on a weekly schedule:

import asyncio
import schedule
import time
from datetime import datetime

async def weekly_job_collection():
    """Run the full job scraping pipeline every Monday."""
    print(f"\n[{datetime.now().isoformat()}] Starting weekly job collection...")
    await run_full_pipeline()
    print("Weekly collection complete.")

def run_async_job():
    asyncio.run(weekly_job_collection())

# Schedule for every Monday at 6 AM
schedule.every().monday.at("06:00").do(run_async_job)

print("Scheduler started. Waiting for next run...")
while True:
    schedule.run_pending()
    time.sleep(60)

Or use a simple cron job on Linux/Mac:

# Edit crontab: crontab -e
# Run every Monday at 6am
0 6 * * 1 cd /path/to/project && python scraper.py >> logs/scraper.log 2>&1

Handling the Most Common Anti-Bot Blocks

Both Zillow and Indeed use anti-bot systems that evolve regularly. Here are the top issues and solutions:

Problem: Zillow shows "Sorry, we couldn't find that page" Zillow detected the automation. Solutions:

Use headless=False to confirm the page renders correctly
Add playwright-stealth stealth patching
Increase delays between page loads to 5–10 seconds
Use a residential proxy with the proxy= parameter in new_context()

Problem: Indeed returns a CAPTCHA page Your IP has been rate-limited. Solutions:

Slow down to 4–8 second delays between requests
Rotate User-Agent strings across requests
Use curl_cffi with impersonate="chrome120" instead of httpx

Problem: CSS selectors stopped finding elements Both sites update their HTML structure regularly. Fix:

Open the site in DevTools, find the current element classes
Update your selectors accordingly
Consider targeting data-testid attributes — these are more stable than CSS classes

Problem: Empty results on Glassdoor Glassdoor aggressively detects headless browsers. Solutions:

Run in headless=False mode during development to debug
Add more human-like scrolling and delays
Consider Glassdoor's own API for authorised developer access

FAQ

Summary

Task	Tool	Key Notes
Zillow scraping	Playwright + stealth	JS-rendered, needs browser automation
Redfin data	httpx + CSV endpoint	Easier, returns structured CSV directly
Indeed job scraping	httpx + BeautifulSoup	Server-rendered, moderate anti-bot
Glassdoor job scraping	Playwright + stealth	JS-heavy, requires browser
Data storage	SQLite + pandas	Lightweight, portable, queryable
Salary analysis	pandas groupby	Extract min/max from raw strings
Scheduling	schedule / cron	Weekly collection for trend analysis

Scraping Real Estate & Job Data with Python: Zillow, Indeed & More (2026)

The Two Most Valuable Data Sets on the Web

Part 1: Scraping Real Estate Data

Understanding Your Options

Scraping Zillow Property Listings with Playwright

Running the Real Estate Scraper

Scraping Redfin (Easier Alternative to Zillow)

Part 2: Scraping Job Listings

Scraping Indeed Job Postings

Scraping Glassdoor for Salaries and Reviews

Part 3: The Combined Data Pipeline

Part 4: Real Estate Price Analysis

Scheduling Weekly Data Collection

Handling the Most Common Anti-Bot Blocks

FAQ

Summary

ZyVOP

Comments (0)

Scraping Real Estate & Job Data with Python: Zillow, Indeed & More (2026)

The Two Most Valuable Data Sets on the Web

Part 1: Scraping Real Estate Data

Understanding Your Options

Scraping Zillow Property Listings with Playwright

Running the Real Estate Scraper

Scraping Redfin (Easier Alternative to Zillow)

Part 2: Scraping Job Listings

Scraping Indeed Job Postings

Scraping Glassdoor for Salaries and Reviews

Part 3: The Combined Data Pipeline

Part 4: Real Estate Price Analysis

Scheduling Weekly Data Collection

Handling the Most Common Anti-Bot Blocks

FAQ

Summary

ZyVOP

Comments (0)

Related Posts

LinkedIn Scraping with Python: Profiles, Jobs & Company Pages

Popular Tags