Scraping Real Estate & Job Data with Python: Zillow, Indeed & More (2026)
Learn how to scrape real estate listings from Zillow and job postings from Indeed and Glassdoor with Python in 2026. Full working code, anti-bot handling, and data pipeline included.
Senior Developer

The Two Most Valuable Data Sets on the Web
Two types of data power some of the most valuable business decisions made every day:
Real estate data — property prices, rental rates, square footage, school ratings, neighbourhood trends. Investors use it to identify undervalued markets. Renters use it to compare neighbourhoods. Data scientists use it to build valuation models.
Job market data — which skills are in demand, which companies are hiring, what salaries look like by role and city, how fast the job market is moving. Career switchers, recruiters, compensation analysts, and economists all need this data.
Both are publicly available. Both are on websites with aggressive anti-bot protection. And neither offers a free, comprehensive API.
This guide shows you how to collect both — responsibly, technically correctly, and with production-quality pipelines that actually run without breaking every week.
Part 1: Scraping Real Estate Data
Understanding Your Options
There are three tiers of approach for real estate data:
Source | Method | Pros | Cons |
|---|---|---|---|
Zillow | Scraping | Rich data, huge coverage | Heavy anti-bot, JS-rendered |
Redfin | Scraping / CSV export | Cleaner HTML, allows CSV download | US only |
Realtor.com | Scraping | Good SERP-level data | Moderate anti-bot |
Government sources | Direct download | Free, accurate, legal | Outdated, less detail |
ATTOM Data / CoreLogic | Paid API | Comprehensive, legal, reliable | Expensive |
For learning and personal projects, scraping Zillow and Redfin is the most instructive. For commercial applications, use ATTOM Data or a licensed provider.
Scraping Zillow Property Listings with Playwright
Zillow is entirely JavaScript-rendered. BeautifulSoup sees almost nothing useful. You need a full browser — Playwright is the right tool.
pip install playwright playwright-stealth pandas
playwright install chromium
import asyncio
import json
import random
import time
import pandas as pd
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
# ── Utilities ─────────────────────────────────────────────────
async def human_delay(min_s=1.5, max_s=4.0):
await asyncio.sleep(random.uniform(min_s, max_s))
async def slow_scroll(page, steps=5):
"""Scroll down gradually to trigger lazy-loaded property cards."""
for _ in range(steps):
scroll_amount = random.randint(300, 700)
await page.evaluate(f"window.scrollBy(0, {scroll_amount})")
await asyncio.sleep(random.uniform(0.4, 1.0))
# ── Main Scraper ────────────────────────────────────────────────
async def scrape_zillow_listings(
search_query: str,
max_pages: int = 3
) -> list[dict]:
"""
Scrape property listings from Zillow search results.
Args:
search_query: City/zip/neighbourhood, e.g. "Austin TX" or "Mumbai"
max_pages: Number of result pages to scrape
Returns:
List of property dicts.
"""
all_listings = []
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-dev-shm-usage",
]
)
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
viewport={"width": 1440, "height": 900},
locale="en-US",
timezone_id="America/Chicago",
)
page = await context.new_page()
await stealth_async(page)
# Block images and fonts to load faster
await context.route(
"**/*.{png,jpg,jpeg,gif,woff,woff2,ttf}",
lambda route: route.abort()
)
# Encode search and build URL
encoded = search_query.replace(" ", "-").lower()
base_url = f"https://www.zillow.com/homes/{encoded}_rb/"
for page_num in range(1, max_pages + 1):
url = base_url if page_num == 1 else f"{base_url}{page_num}_p/"
print(f"Scraping page {page_num}: {url}")
await page.goto(url, wait_until="domcontentloaded")
await human_delay(2, 4)
await slow_scroll(page, steps=6)
await human_delay(1, 2)
# Extract from the embedded __NEXT_DATA__ JSON — more reliable
# than CSS selectors which change frequently
listings = await page.evaluate("""
() => {
const scripts = document.querySelectorAll(
'script[type="application/json"]'
);
for (const script of scripts) {
try {
const data = JSON.parse(script.textContent);
// Walk the JSON tree to find search results
const results = data?.props?.pageProps
?.searchPageState?.cat1
?.searchResults?.listResults;
if (results) return results;
} catch {}
}
return [];
}
""")
if not listings:
# Fallback: extract from visible cards via CSS
listings = await extract_zillow_cards_css(page)
print(f" Found {len(listings)} listings on page {page_num}")
all_listings.extend(listings)
# Check if there's a next page
next_btn = await page.query_selector("a[title='Next page']")
if not next_btn:
print(" No more pages found.")
break
await human_delay(3, 6)
await browser.close()
return all_listings
async def extract_zillow_cards_css(page) -> list[dict]:
"""CSS-based fallback extractor for Zillow listing cards."""
cards = await page.query_selector_all("article.list-card")
results = []
for card in cards:
try:
price_el = await card.query_selector(".list-card-price")
address_el = await card.query_selector("address")
details_el = await card.query_selector(".list-card-details")
link_el = await card.query_selector("a.list-card-link")
price = await price_el.inner_text() if price_el else None
address = await address_el.inner_text() if address_el else None
details = await details_el.inner_text() if details_el else None
href = await link_el.get_attribute("href") if link_el else None
if price or address:
results.append({
"price": price,
"address": address,
"details": details,
"url": f"https://www.zillow.com{href}" if href else None,
})
except Exception:
continue
return results
def parse_zillow_json_listing(raw: dict) -> dict:
"""Parse a raw Zillow JSON listing into a clean dict."""
return {
"address": raw.get("address"),
"price": raw.get("price"),
"price_raw": raw.get("unformattedPrice"),
"beds": raw.get("beds"),
"baths": raw.get("baths"),
"sqft": raw.get("area"),
"property_type": raw.get("hdpData", {}).get("homeInfo", {})
.get("homeType"),
"days_on_market": raw.get("hdpData", {}).get("homeInfo", {})
.get("daysOnZillow"),
"zestimate": raw.get("hdpData", {}).get("homeInfo", {})
.get("zestimate"),
"latitude": raw.get("latLong", {}).get("latitude"),
"longitude": raw.get("latLong", {}).get("longitude"),
"listing_url": raw.get("detailUrl"),
"zpid": raw.get("zpid"),
"status": raw.get("statusType"),
}Running the Real Estate Scraper
async def main_real_estate():
raw_listings = await scrape_zillow_listings(
search_query="Austin TX",
max_pages=3
)
# Parse JSON listings if available, otherwise use raw CSS-extracted data
clean = []
for item in raw_listings:
if isinstance(item, dict) and "zpid" in item:
clean.append(parse_zillow_json_listing(item))
elif isinstance(item, dict):
clean.append(item)
df = pd.DataFrame(clean)
# Clean price column
if "price_raw" in df.columns:
df["price_usd"] = pd.to_numeric(
df["price_raw"].astype(str).str.replace(r"[^\d]", "", regex=True),
errors="coerce"
)
df.to_csv("zillow_listings.csv", index=False)
print(f"\nSaved {len(df)} listings to zillow_listings.csv")
print(df[["address", "price", "beds", "baths", "sqft"]].head(10))
asyncio.run(main_real_estate())Scraping Redfin (Easier Alternative to Zillow)
Redfin is significantly less aggressive with bot detection than Zillow, and even offers a hidden CSV download endpoint:
import httpx
import io
import pandas as pd
async def download_redfin_data(region_id: str, region_type: str = "6") -> pd.DataFrame:
"""
Download property data directly from Redfin's CSV endpoint.
To find your region_id:
1. Search for a city on redfin.com
2. Look at the URL: /city/30772/TX/Austin → region_id = 30772
region_type: 2=zip, 4=neighbourhood, 6=city
"""
url = (
"https://www.redfin.com/stingray/api/gis-csv"
f"?region_id={region_id}®ion_type={region_type}"
"&status=1&hoa=0&travel_with_traffic=false&uipt=1,2,3,4,5,6,7,13"
"&sf=1,2,3,5,6,7&num_homes=350&v=8"
)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120",
"Accept": "text/html,application/xhtml+xml,*/*",
"Referer": "https://www.redfin.com/",
}
async with httpx.AsyncClient(headers=headers) as client:
response = await client.get(url, timeout=30)
response.raise_for_status()
# Response is a CSV — parse directly into pandas
df = pd.read_csv(io.StringIO(response.text))
return df
# Austin, TX example
df = asyncio.run(download_redfin_data(region_id="30772", region_type="6"))
print(f"Downloaded {len(df)} listings from Redfin")
print(df.columns.tolist())
print(df[["ADDRESS", "PRICE", "BEDS", "BATHS", "SQUARE FEET"]].head())
df.to_csv("redfin_austin.csv", index=False)This is clean, fast, and returns hundreds of listings in seconds — making Redfin the preferred source for real estate data when you don't need Zillow-specific features.
Part 2: Scraping Job Listings
Scraping Indeed Job Postings
Indeed is one of the largest job boards in the world. Its job search results are server-rendered HTML, which makes them easier to scrape than Zillow — but it still uses bot detection.
import httpx
import asyncio
import random
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlencode, urljoin
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,*/*;q=0.8",
"Referer": "https://in.indeed.com/",
}
async def scrape_indeed_jobs(
job_title: str,
location: str,
num_pages: int = 5,
country: str = "in" # "in" for India, "com" for US, "co.uk" for UK
) -> list[dict]:
"""
Scrape job listings from Indeed.
Args:
job_title: e.g. "Python Developer", "Data Scientist"
location: e.g. "Bangalore", "Mumbai", "Remote"
num_pages: Pages to scrape (10 jobs per page)
country: Domain suffix: "in", "com", "co.uk", "com.au"
"""
base = f"https://{country}.indeed.com"
all_jobs = []
async with httpx.AsyncClient(headers=HEADERS, follow_redirects=True) as client:
for page in range(num_pages):
params = {
"q": job_title,
"l": location,
"start": page * 10,
"sort": "date", # Most recent first
"fromage": "14", # Jobs from last 14 days
}
url = f"{base}/jobs?{urlencode(params)}"
print(f" Scraping page {page + 1}: {url}")
try:
r = await client.get(url, timeout=20)
r.raise_for_status()
except httpx.HTTPStatusError as e:
print(f" HTTP {e.response.status_code} on page {page+1}")
break
jobs = parse_indeed_page(r.text, base_url=base)
all_jobs.extend(jobs)
print(f" Found {len(jobs)} jobs on page {page + 1}")
if len(jobs) == 0:
print(" No more jobs found. Stopping.")
break
await asyncio.sleep(random.uniform(2.0, 4.5))
return all_jobs
def parse_indeed_page(html: str, base_url: str) -> list[dict]:
"""Parse job cards from an Indeed search results page."""
soup = BeautifulSoup(html, "lxml")
cards = soup.select("div.job_seen_beacon")
jobs = []
for card in cards:
try:
# Title and URL
title_el = card.select_one("h2.jobTitle a")
title = title_el.get_text(strip=True) if title_el else None
href = title_el.get("href", "") if title_el else ""
job_url = urljoin(base_url, href)
# Company
company_el = card.select_one("[data-testid='company-name']")
company = company_el.get_text(strip=True) if company_el else None
# Location
location_el = card.select_one("[data-testid='text-location']")
location = location_el.get_text(strip=True) if location_el else None
# Salary (often missing)
salary_el = card.select_one("[data-testid='attribute_snippet_testid']")
salary = salary_el.get_text(strip=True) if salary_el else None
# Snippet / job description preview
snippet_el = card.select_one(".job-snippet")
snippet = snippet_el.get_text(separator=" ", strip=True) if snippet_el else None
# Posted date
date_el = card.select_one("[data-testid='myJobsStateDate']")
posted = date_el.get_text(strip=True) if date_el else None
# Indeed job key (unique ID)
job_key = card.get("data-jk") or card.find_parent(
attrs={"data-jk": True}
)
job_key = job_key.get("data-jk") if hasattr(job_key, "get") else None
if title and company:
jobs.append({
"title": title,
"company": company,
"location": location,
"salary": salary,
"snippet": snippet,
"posted": posted,
"url": job_url,
"job_key": job_key,
})
except Exception:
continue
return jobs
async def run_indeed_scraper():
searches = [
("Python Developer", "Bangalore"),
("Data Scientist", "Mumbai"),
("Machine Learning Engineer", "Remote"),
]
all_results = []
for job_title, location in searches:
print(f"\nSearching: '{job_title}' in '{location}'")
jobs = await scrape_indeed_jobs(job_title, location, num_pages=3)
for job in jobs:
job["search_title"] = job_title
job["search_location"] = location
all_results.extend(jobs)
df = pd.DataFrame(all_results).drop_duplicates(subset=["job_key"])
df.to_csv("indeed_jobs.csv", index=False)
print(f"\nTotal unique jobs collected: {len(df)}")
print(df[["title", "company", "location", "salary"]].head(10))
return df
asyncio.run(run_indeed_scraper())Scraping Glassdoor for Salaries and Reviews
Glassdoor requires a login for most data, but salary ranges and job listings are partially accessible without authentication:
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async def scrape_glassdoor_jobs(
job_title: str,
location: str,
num_pages: int = 3
) -> list[dict]:
"""
Scrape Glassdoor job listings.
Glassdoor is JS-heavy and requires Playwright.
"""
all_jobs = []
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--no-sandbox", "--disable-blink-features=AutomationControlled"]
)
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
),
viewport={"width": 1280, "height": 800},
)
page = await context.new_page()
await stealth_async(page)
for page_num in range(1, num_pages + 1):
# Glassdoor URL structure
title_slug = job_title.lower().replace(" ", "-")
loc_slug = location.lower().replace(" ", "-")
url = (
f"https://www.glassdoor.co.in/Job/"
f"{loc_slug}-{title_slug}-jobs-SRCH_IL.0,{len(loc_slug)}"
f"_IC1278971_KO{len(loc_slug)+1},{len(loc_slug)+len(title_slug)+1}"
f"_IP{page_num}.htm"
)
print(f"Glassdoor page {page_num}: {url}")
await page.goto(url, wait_until="domcontentloaded")
await asyncio.sleep(random.uniform(2, 4))
# Scroll to load lazy-loaded jobs
for _ in range(4):
await page.evaluate("window.scrollBy(0, 600)")
await asyncio.sleep(0.8)
# Extract job cards
job_cards = await page.query_selector_all("li.react-job-listing")
for card in job_cards:
try:
title_el = await card.query_selector("[data-test='job-title']")
company_el = await card.query_selector("[data-test='employer-name']")
location_el = await card.query_selector("[data-test='emp-location']")
salary_el = await card.query_selector("[data-test='detailSalary']")
rating_el = await card.query_selector("[data-test='rating']")
job = {
"title": await title_el.inner_text() if title_el else None,
"company": await company_el.inner_text() if company_el else None,
"location": await location_el.inner_text() if location_el else None,
"salary": await salary_el.inner_text() if salary_el else None,
"rating": await rating_el.inner_text() if rating_el else None,
"source": "glassdoor",
}
if job["title"]:
all_jobs.append(job)
except Exception:
continue
print(f" Collected {len(all_jobs)} total jobs so far")
await asyncio.sleep(random.uniform(3, 6))
await browser.close()
return all_jobsPart 3: The Combined Data Pipeline
Now let's build a unified pipeline that scrapes from multiple sources, deduplicates, enriches, and stores everything:
import asyncio
import pandas as pd
import sqlalchemy as sa
from datetime import datetime, timezone
class JobMarketPipeline:
"""
Unified pipeline for job market data from multiple sources.
Stores to SQLite for lightweight, portable storage.
"""
DB_URL = "sqlite:///job_market.db"
def __init__(self):
self.engine = sa.create_engine(self.DB_URL)
self._ensure_schema()
def _ensure_schema(self):
with self.engine.connect() as conn:
conn.execute(sa.text("""
CREATE TABLE IF NOT EXISTS jobs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT,
company TEXT,
location TEXT,
salary_raw TEXT,
salary_min REAL,
salary_max REAL,
snippet TEXT,
source TEXT,
search_query TEXT,
url TEXT UNIQUE,
scraped_at TEXT
)
"""))
conn.commit()
def _parse_salary(self, salary_str: str) -> tuple[float | None, float | None]:
"""Extract min/max salary from strings like '₹8L–₹15L/yr' or '$80k–$120k'."""
if not salary_str:
return None, None
import re
numbers = re.findall(r"[\d,]+\.?\d*", salary_str.replace(",", ""))
if len(numbers) >= 2:
return float(numbers[0]), float(numbers[1])
elif len(numbers) == 1:
return float(numbers[0]), None
return None, None
def save_jobs(self, jobs: list[dict], search_query: str):
"""Save job listings to SQLite, skipping duplicates by URL."""
records = []
now = datetime.now(timezone.utc).isoformat()
for job in jobs:
salary_min, salary_max = self._parse_salary(job.get("salary"))
records.append({
"title": job.get("title"),
"company": job.get("company"),
"location": job.get("location"),
"salary_raw": job.get("salary"),
"salary_min": salary_min,
"salary_max": salary_max,
"snippet": job.get("snippet"),
"source": job.get("source", "indeed"),
"search_query": search_query,
"url": job.get("url"),
"scraped_at": now,
})
df = pd.DataFrame(records).drop_duplicates(subset=["url"])
df.to_sql("jobs", self.engine, if_exists="append", index=False,
method="multi")
print(f" Saved {len(df)} new jobs to database.")
def get_salary_analysis(self) -> pd.DataFrame:
"""Analyse salary distributions by job title."""
query = """
SELECT
search_query,
COUNT(*) as total_listings,
COUNT(salary_min) as listings_with_salary,
ROUND(AVG(salary_min), 0) as avg_min_salary,
ROUND(AVG(salary_max), 0) as avg_max_salary,
ROUND(MIN(salary_min), 0) as lowest_salary,
ROUND(MAX(salary_max), 0) as highest_salary
FROM jobs
WHERE salary_min IS NOT NULL
GROUP BY search_query
ORDER BY avg_min_salary DESC
"""
return pd.read_sql(query, self.engine)
def get_top_hiring_companies(self, search_query: str = None) -> pd.DataFrame:
"""Find which companies are hiring the most."""
where = f"WHERE search_query = '{search_query}'" if search_query else ""
query = f"""
SELECT
company,
COUNT(*) as open_positions,
COUNT(DISTINCT location) as cities
FROM jobs
{where}
GROUP BY company
HAVING COUNT(*) >= 2
ORDER BY open_positions DESC
LIMIT 20
"""
return pd.read_sql(query, self.engine)
def get_location_demand(self) -> pd.DataFrame:
"""Rank cities by number of job postings."""
query = """
SELECT
location,
COUNT(*) as job_count,
COUNT(DISTINCT company) as unique_companies
FROM jobs
WHERE location IS NOT NULL
AND location != ''
GROUP BY location
ORDER BY job_count DESC
LIMIT 15
"""
return pd.read_sql(query, self.engine)
async def run_full_pipeline():
"""
Complete pipeline: scrape Indeed for multiple roles → store → analyse.
"""
pipeline = JobMarketPipeline()
searches = [
("Python Developer", "Bangalore"),
("Data Scientist", "Hyderabad"),
("Machine Learning Engineer", "Remote"),
("Backend Engineer", "Mumbai"),
("DevOps Engineer", "Pune"),
]
for job_title, location in searches:
print(f"\n── Scraping: '{job_title}' in '{location}' ──")
jobs = await scrape_indeed_jobs(job_title, location, num_pages=3)
pipeline.save_jobs(jobs, search_query=job_title)
await asyncio.sleep(random.uniform(2, 5))
# ── Analysis ────────────────────────────────────────────────
print("\n\n══ JOB MARKET ANALYSIS ══\n")
print("── Salary Analysis by Role ──")
salary_df = pipeline.get_salary_analysis()
print(salary_df.to_string(index=False))
print("\n── Top Hiring Companies ──")
companies_df = pipeline.get_top_hiring_companies()
print(companies_df.head(10).to_string(index=False))
print("\n── Demand by Location ──")
location_df = pipeline.get_location_demand()
print(location_df.to_string(index=False))
# Export final report
all_jobs = pd.read_sql("SELECT * FROM jobs", pipeline.engine)
all_jobs.to_csv("job_market_report.csv", index=False)
print(f"\nFull dataset: {len(all_jobs)} jobs saved to job_market_report.csv")
asyncio.run(run_full_pipeline())Part 4: Real Estate Price Analysis
Once you have Redfin or Zillow data, here's how to turn it into useful insights:
import pandas as pd
import json
def analyse_real_estate(csv_file: str) -> dict:
"""
Load and analyse a real estate dataset.
Works with both Redfin CSV and processed Zillow data.
"""
df = pd.read_csv(csv_file)
# Standardise column names
col_map = {
"PRICE": "price",
"BEDS": "beds",
"BATHS": "baths",
"SQUARE FEET": "sqft",
"ADDRESS": "address",
"CITY": "city",
"price_raw": "price",
"beds": "beds",
}
df = df.rename(columns={k: v for k, v in col_map.items() if k in df.columns})
# Clean numeric columns
for col in ["price", "sqft"]:
if col in df.columns:
df[col] = pd.to_numeric(
df[col].astype(str).str.replace(r"[^\d.]", "", regex=True),
errors="coerce"
)
# Price per sqft
if "price" in df.columns and "sqft" in df.columns:
df["price_per_sqft"] = (df["price"] / df["sqft"]).round(2)
summary = {
"total_listings": len(df),
"median_price": df["price"].median() if "price" in df.columns else None,
"avg_price": df["price"].mean() if "price" in df.columns else None,
"min_price": df["price"].min() if "price" in df.columns else None,
"max_price": df["price"].max() if "price" in df.columns else None,
"avg_price_per_sqft": df["price_per_sqft"].mean() if "price_per_sqft" in df.columns else None,
"avg_sqft": df["sqft"].mean() if "sqft" in df.columns else None,
}
print("── Real Estate Summary ──")
for k, v in summary.items():
if v is not None:
print(f" {k:30s}: {v:,.0f}")
# Price distribution by beds
if "beds" in df.columns and "price" in df.columns:
by_beds = df.groupby("beds")["price"].agg(["count", "median", "mean"])
by_beds.columns = ["count", "median_price", "avg_price"]
print("\n── Prices by Bedroom Count ──")
print(by_beds.to_string())
return summary, df
summary, df = analyse_real_estate("redfin_austin.csv")
df.to_csv("redfin_austin_clean.csv", index=False)Scheduling Weekly Data Collection
To track market trends over time, run your scrapers on a weekly schedule:
import asyncio
import schedule
import time
from datetime import datetime
async def weekly_job_collection():
"""Run the full job scraping pipeline every Monday."""
print(f"\n[{datetime.now().isoformat()}] Starting weekly job collection...")
await run_full_pipeline()
print("Weekly collection complete.")
def run_async_job():
asyncio.run(weekly_job_collection())
# Schedule for every Monday at 6 AM
schedule.every().monday.at("06:00").do(run_async_job)
print("Scheduler started. Waiting for next run...")
while True:
schedule.run_pending()
time.sleep(60)Or use a simple cron job on Linux/Mac:
# Edit crontab: crontab -e
# Run every Monday at 6am
0 6 * * 1 cd /path/to/project && python scraper.py >> logs/scraper.log 2>&1Handling the Most Common Anti-Bot Blocks
Both Zillow and Indeed use anti-bot systems that evolve regularly. Here are the top issues and solutions:
Problem: Zillow shows "Sorry, we couldn't find that page" Zillow detected the automation. Solutions:
Use
headless=Falseto confirm the page renders correctlyAdd
playwright-stealthstealth patchingIncrease delays between page loads to 5–10 seconds
Use a residential proxy with the
proxy=parameter innew_context()
Problem: Indeed returns a CAPTCHA page Your IP has been rate-limited. Solutions:
Slow down to 4–8 second delays between requests
Rotate User-Agent strings across requests
Use
curl_cffiwithimpersonate="chrome120"instead ofhttpx
Problem: CSS selectors stopped finding elements Both sites update their HTML structure regularly. Fix:
Open the site in DevTools, find the current element classes
Update your selectors accordingly
Consider targeting
data-testidattributes — these are more stable than CSS classes
Problem: Empty results on Glassdoor Glassdoor aggressively detects headless browsers. Solutions:
Run in
headless=Falsemode during development to debugAdd more human-like scrolling and delays
Consider Glassdoor's own API for authorised developer access
FAQ
Q: Is it legal to scrape Zillow and Indeed? Both sites' Terms of Service prohibit automated scraping. For personal research and learning, the risk is minimal — neither typically pursues individuals. For commercial applications (building a competing product, reselling data), use licensed data providers: CoreLogic or ATTOM for real estate; LinkedIn Talent Insights or Burning Glass for jobs.
Q: How often should I run the scraper? For job market trend analysis: weekly is sufficient. For real-time job alerts: daily. For real estate market analysis: weekly or bi-weekly. Running more frequently increases ban risk without adding proportional value.
Q: What's the best free real estate data source? For US data, the Redfin CSV download endpoint (shown above) is the most accessible free source. For international data, government land registry APIs (UK Land Registry, Indian Registration Department portals) provide official transaction data.
Q: Can I scrape salary data from LinkedIn? LinkedIn shows salary ranges on some job postings. The scraping approach is similar to Blog 02 (LinkedIn scraping with Playwright) — use saved cookies, stealth mode, and conservative delays. For systematic salary research, LinkedIn Talent Insights is the official paid option.
Q: How do I avoid my IP being banned permanently? Keep daily request volume low (under 200/day per IP), use residential proxies for volume work, respect robots.txt, and never attempt to log in automatically. If you get a temporary block, wait 24 hours before retrying from a different IP.
Summary
Task | Tool | Key Notes |
|---|---|---|
Zillow scraping | Playwright + stealth | JS-rendered, needs browser automation |
Redfin data | httpx + CSV endpoint | Easier, returns structured CSV directly |
Indeed job scraping | httpx + BeautifulSoup | Server-rendered, moderate anti-bot |
Glassdoor job scraping | Playwright + stealth | JS-heavy, requires browser |
Data storage | SQLite + pandas | Lightweight, portable, queryable |
Salary analysis | pandas groupby | Extract min/max from raw strings |
Scheduling | schedule / cron | Weekly collection for trend analysis |
Comments (0)
Login to post a comment.