LinkedIn Scraping with Python: Profiles, Jobs & Company Pages
Learn how to scrape LinkedIn profiles, jobs, and company pages with Playwright while managing authentication, stealth, rate limits, and account safety.
Senior Developer

LinkedIn is arguably the most valuable professional database on the internet — hundreds of millions of profiles, millions of job postings, and rich company data all in one place. It's no surprise that data scientists, recruiters, and researchers want to access it programmatically.
But LinkedIn is also the hardest major site to scrape. It uses:
Heavy JavaScript rendering — almost no meaningful content is in the initial HTML
Session-based authentication — most data is only visible when logged in
Aggressive bot detection — including TLS fingerprinting, behavioral analysis, and account-level rate limiting
Frequent HTML structure changes — CSS selectors break regularly as LinkedIn updates its frontend
This guide gives you a practical, working approach to scraping LinkedIn in 2025. We'll cover two methods: Playwright (recommended) and an overview of the official API as an alternative.
Legal and ethical notice: LinkedIn's Terms of Service prohibit automated scraping. The hiQ v. LinkedIn lawsuit (which argued that publicly available data could be scraped under the Computer Fraud and Abuse Act) has had complex, ongoing outcomes. This guide is for educational purposes. Before scraping LinkedIn for any commercial purpose, consult a lawyer. For most production use cases, LinkedIn's official API or a licensed data provider is a safer choice.
Why Playwright Over Selenium
Both Playwright and Selenium control a real browser, which is necessary for JavaScript-heavy sites. Playwright has several advantages:
Feature | Playwright | Selenium |
|---|---|---|
Speed | Faster async API | Slower |
Stealth | Better fingerprint control | Harder to configure |
API quality | Modern, intuitive | Older, verbose |
Browser support | Chromium, Firefox, WebKit | Chrome, Firefox, Edge |
Auto-waiting | Built-in smart waits | Manual waits required |
Community (2025) | Rapidly growing | Mature but stagnant |
For LinkedIn specifically, Playwright with playwright-stealth is the current best option.
Installation
pip install playwright playwright-stealth pandas
playwright install chromiumThis downloads the Chromium browser binary that Playwright will control.
Step 1: Saving Your LinkedIn Session (One-Time Setup)
The golden rule for LinkedIn scraping is: never automate the login itself. LinkedIn's login page is heavily monitored and a bot logging in triggers immediate account flags.
Instead, log in manually once in a Playwright browser, save your session cookies to a file, and reuse those cookies for all future scraping sessions.
# save_cookies.py — run this once manually
import json
from playwright.sync_api import sync_playwright
def save_linkedin_session():
with sync_playwright() as p:
# Launch a visible (non-headless) browser so you can log in yourself
browser = p.chromium.launch(headless=False)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800}
)
page = context.new_page()
page.goto("https://www.linkedin.com/login")
# Wait for you to log in manually — you have 60 seconds
print("Please log in to LinkedIn in the browser window...")
print("The script will continue once you're on the home feed.")
# Wait until the feed appears — this confirms a successful login
page.wait_for_url("**/feed/**", timeout=60000)
print("Login detected! Saving cookies...")
cookies = context.cookies()
with open("linkedin_cookies.json", "w") as f:
json.dump(cookies, f, indent=2)
print(f"Saved {len(cookies)} cookies to linkedin_cookies.json")
browser.close()
save_linkedin_session()Run this script once, log in manually, and your cookies are saved. These cookies typically stay valid for weeks before LinkedIn requires re-authentication.
Step 2: Loading Your Session in Future Scrapes
import json
from playwright.sync_api import sync_playwright
def create_linkedin_context(playwright):
"""Create a browser context pre-loaded with your saved LinkedIn cookies."""
browser = playwright.chromium.launch(
headless=True, # Can run headless once cookies are saved
args=[
"--no-sandbox",
"--disable-blink-features=AutomationControlled",
]
)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 900},
locale="en-US",
timezone_id="America/New_York"
)
# Load saved cookies
with open("linkedin_cookies.json") as f:
cookies = json.load(f)
context.add_cookies(cookies)
return browser, contextStep 3: Applying Stealth Mode
Without stealth patches, Playwright exposes dozens of signals that LinkedIn can use to identify it as a bot. The playwright-stealth package patches the most important ones:
from playwright_stealth import stealth_sync
def create_stealthy_page(context):
"""Create a page with stealth patches applied."""
page = context.new_page()
# Apply stealth — patches navigator.webdriver, plugins, languages, etc.
stealth_sync(page)
return pageThe key things stealth patches:
navigator.webdriver→ set toundefined(normallytruein automation)navigator.plugins→ adds realistic fake pluginsnavigator.languages→ set to["en-US", "en"]window.chrome→ adds the Chrome runtime objectCanvas fingerprinting → adds slight noise to prevent fingerprint matching
Scraping LinkedIn Profiles
A LinkedIn profile URL looks like: https://www.linkedin.com/in/username
import time
import random
import json
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
def human_delay(min_sec=1.5, max_sec=4.0):
"""Random delay to mimic human browsing behavior."""
time.sleep(random.uniform(min_sec, max_sec))
def scroll_to_load(page, scrolls=5):
"""Scroll down the page to trigger lazy-loaded sections."""
for _ in range(scrolls):
page.evaluate("window.scrollBy(0, window.innerHeight * 0.8)")
time.sleep(random.uniform(0.8, 1.5))
def scrape_profile(page, profile_url):
"""Extract structured data from a LinkedIn profile page."""
page.goto(profile_url, wait_until="domcontentloaded")
human_delay(2, 3.5)
# Scroll to load experience, education, and skills sections
scroll_to_load(page, scrolls=6)
human_delay(1, 2)
data = {}
# --- Basic Info ---
try:
data["name"] = page.query_selector("h1").inner_text().strip()
except:
data["name"] = None
try:
data["headline"] = page.query_selector(
".text-body-medium.break-words"
).inner_text().strip()
except:
data["headline"] = None
try:
data["location"] = page.query_selector(
".text-body-small.inline.t-black--light.break-words"
).inner_text().strip()
except:
data["location"] = None
# --- About Section ---
try:
about_el = page.query_selector("#about ~ div .full-width")
data["about"] = about_el.inner_text().strip() if about_el else None
except:
data["about"] = None
# --- Experience ---
data["experience"] = []
try:
exp_items = page.query_selector_all(
"#experience ~ div li.artdeco-list__item"
)
for item in exp_items:
title_el = item.query_selector(".t-bold span")
company_el = item.query_selector(".t-14.t-normal span")
duration_el = item.query_selector(".t-14.t-normal.t-black--light span")
data["experience"].append({
"title": title_el.inner_text().strip() if title_el else None,
"company": company_el.inner_text().strip() if company_el else None,
"duration": duration_el.inner_text().strip() if duration_el else None,
})
except:
pass
# --- Education ---
data["education"] = []
try:
edu_items = page.query_selector_all(
"#education ~ div li.artdeco-list__item"
)
for item in edu_items:
school_el = item.query_selector(".t-bold span")
degree_el = item.query_selector(".t-14.t-normal span")
data["education"].append({
"school": school_el.inner_text().strip() if school_el else None,
"degree": degree_el.inner_text().strip() if degree_el else None,
})
except:
pass
# --- Skills ---
data["skills"] = []
try:
skill_els = page.query_selector_all(
"#skills ~ div .t-bold span[aria-hidden='true']"
)
data["skills"] = [el.inner_text().strip() for el in skill_els[:20]]
except:
pass
data["profile_url"] = profile_url
return data
# Main execution
def main():
profile_urls = [
"https://www.linkedin.com/in/some-public-profile/",
# Add more profile URLs here
]
results = []
with sync_playwright() as p:
browser, context = create_linkedin_context(p)
page = create_stealthy_page(context)
for i, url in enumerate(profile_urls):
print(f"[{i+1}/{len(profile_urls)}] Scraping: {url}")
profile_data = scrape_profile(page, url)
results.append(profile_data)
human_delay(3, 6) # longer delay between profiles
browser.close()
# Save results
with open("profiles.json", "w") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
print(f"Saved {len(results)} profiles to profiles.json")
if __name__ == "__main__":
main()Scraping LinkedIn Job Listings
Job listings are slightly easier to scrape than profiles because they are available without being logged in (for most searches). Here is a dedicated job scraper:
import time
import random
import pandas as pd
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
def scrape_linkedin_jobs(keywords, location, num_pages=5):
"""
Scrape LinkedIn job listings for given keywords and location.
Args:
keywords: Job search term, e.g. "python developer"
location: Location string, e.g. "India"
num_pages: How many pages to scrape (25 jobs per page)
"""
base_url = (
"https://www.linkedin.com/jobs/search/"
f"?keywords={keywords.replace(' ', '%20')}"
f"&location={location.replace(' ', '%20')}"
f"&start={{}}"
)
all_jobs = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120"
)
page = context.new_page()
stealth_sync(page)
for page_num in range(num_pages):
start = page_num * 25
url = base_url.format(start)
print(f"Scraping page {page_num + 1}: {url}")
page.goto(url, wait_until="domcontentloaded")
time.sleep(random.uniform(2, 4))
# Scroll to load all job cards
for _ in range(3):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(1)
job_cards = page.query_selector_all(".jobs-search__results-list li")
for card in job_cards:
try:
title = card.query_selector(
".base-search-card__title"
).inner_text().strip()
company = card.query_selector(
".base-search-card__subtitle"
).inner_text().strip()
location_el = card.query_selector(
".job-search-card__location"
)
job_location = location_el.inner_text().strip() if location_el else ""
link = card.query_selector("a.base-card__full-link")
job_url = link.get_attribute("href") if link else ""
date_el = card.query_selector("time")
posted_date = date_el.get_attribute("datetime") if date_el else ""
all_jobs.append({
"title": title,
"company": company,
"location": job_location,
"posted_date": posted_date,
"url": job_url
})
except Exception as e:
continue # Skip malformed cards
time.sleep(random.uniform(2, 5))
browser.close()
df = pd.DataFrame(all_jobs).drop_duplicates(subset=["url"])
return df
# Run
df = scrape_linkedin_jobs("python developer", "India", num_pages=4)
df.to_csv("linkedin_jobs.csv", index=False)
print(f"Scraped {len(df)} unique job listings")
print(df[["title", "company", "location"]].head(10))Scraping Company Pages
Company pages contain employee count, industry, headquarters location, and a "People" section showing employees.
def scrape_company_page(page, company_url):
"""Extract data from a LinkedIn company page."""
page.goto(company_url, wait_until="domcontentloaded")
time.sleep(random.uniform(2, 4))
# Scroll to load all sections
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(2)
data = {"url": company_url}
# Company name
try:
data["name"] = page.query_selector("h1").inner_text().strip()
except:
data["name"] = None
# Tagline / description
try:
data["tagline"] = page.query_selector(
".org-top-card-summary__tagline"
).inner_text().strip()
except:
data["tagline"] = None
# Overview stats (employees, industry, HQ, type)
data["overview"] = {}
try:
overview_items = page.query_selector_all(
".org-about-module__margin-bottom"
)
for item in overview_items:
label_el = item.query_selector("dt")
value_el = item.query_selector("dd")
if label_el and value_el:
label = label_el.inner_text().strip()
value = value_el.inner_text().strip()
data["overview"][label] = value
except:
pass
# Number of followers
try:
followers_el = page.query_selector(
".org-top-card-summary-info-list__info-item"
)
data["followers"] = followers_el.inner_text().strip() if followers_el else None
except:
data["followers"] = None
return dataRate Limiting and Account Safety
LinkedIn can permanently ban accounts that scrape aggressively. Follow these guidelines to minimize risk:
Delays between requests:
# Conservative — for accounts you care about
time.sleep(random.uniform(5, 12))
# Moderate
time.sleep(random.uniform(2, 5))
# Risky — only for throwaway accounts
time.sleep(random.uniform(0.5, 1.5))Daily limits to stay safe:
Profile views: under 80–100 per day (LinkedIn counts these)
Job page loads: under 200 per day
Company pages: under 50 per day
Use a dedicated scraping account: Never scrape with your main personal LinkedIn account. Create a separate account specifically for scraping. If it gets banned, your real professional presence is unaffected.
Rotate your session:
import random
def get_random_user_agent():
agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/119.0.0.0",
"Mozilla/5.0 (X11; Linux x86_64) Chrome/118.0.0.0",
]
return random.choice(agents)The Better Alternative: LinkedIn's Official API
For production use, LinkedIn's official API is more reliable and legally sound. LinkedIn offers:
LinkedIn Marketing API — for ad data and company insights
LinkedIn Sign In with LinkedIn — OAuth-based profile data access
LinkedIn Learning API — for course and learning data
LinkedIn Talent Insights — paid enterprise product for HR analytics
Apply at: developer.linkedin.com
For bulk data, licensed data providers like People Data Labs, Clearbit, and Apollo.io aggregate LinkedIn data legally and expose it via clean APIs.
Debugging Common LinkedIn Scraping Issues
Issue: Page loads but the data isn't there LinkedIn lazy-loads content. Always scroll the page before extracting:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(2)Issue: Redirected to login page Your cookies have expired. Re-run save_cookies.py and log in again.
Issue: "Hmm, something went wrong" screen LinkedIn detected automation. Reduce speed, use stealth mode, and add more human-like interactions.
Issue: Getting empty strings from selectors LinkedIn frequently changes its CSS classes. Inspect the current page in DevTools and update your selectors. Using XPath is sometimes more stable than class-based CSS selectors.
# XPath is often more robust for LinkedIn
name = page.query_selector("xpath=//h1[contains(@class, 'top-card')]")
Full Pipeline: Scrape, Save, and Analyse
import json
import pandas as pd
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
def run_pipeline(profile_urls):
results = []
with sync_playwright() as p:
browser, context = create_linkedin_context(p)
page = create_stealthy_page(context)
for i, url in enumerate(profile_urls):
print(f"[{i+1}/{len(profile_urls)}] {url}")
try:
data = scrape_profile(page, url)
results.append(data)
except Exception as e:
print(f" Error: {e}")
results.append({"profile_url": url, "error": str(e)})
human_delay(4, 8)
browser.close()
# Save raw JSON
with open("profiles_raw.json", "w") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
# Flatten to DataFrame (ignoring nested lists for CSV)
flat = []
for r in results:
flat.append({
"name": r.get("name"),
"headline": r.get("headline"),
"location": r.get("location"),
"about": r.get("about"),
"experience_count": len(r.get("experience", [])),
"first_role": r.get("experience", [{}])[0].get("title") if r.get("experience") else None,
"first_company": r.get("experience", [{}])[0].get("company") if r.get("experience") else None,
"skills_count": len(r.get("skills", [])),
"top_skill": r.get("skills", [None])[0],
"url": r.get("profile_url")
})
df = pd.DataFrame(flat)
df.to_csv("profiles_summary.csv", index=False)
print(f"\nSaved {len(df)} profiles.")
print(df[["name", "headline", "first_company"]].head())
return df
# Run it
urls = [
"https://www.linkedin.com/in/example-profile-1/",
"https://www.linkedin.com/in/example-profile-2/",
]
df = run_pipeline(urls)Summary
Topic | Key takeaway |
|---|---|
Authentication | Never automate login — save and reuse cookies |
Stealth | Use |
Delays | Random delays between 3–8 seconds between profiles |
Profiles | Scroll to load, extract name/headline/experience/education |
Jobs | Available without login, 25 per page |
Company pages | Rich overview stats available in |
Account safety | Use a dedicated scraping account, stay under 100 profiles/day |
Best alternative | LinkedIn API or licensed data providers for production |
Comments (0)
Login to post a comment.