ZyVOP Logo
Content That Connects
SeriesCategoriesTags
ZyVOP Logo
Content That Connects

Empowering developers and creators with cutting-edge insights, comprehensive tutorials, and innovative solutions for the digital future.

Content

  • Tags
  • Write Article

Company

  • About Us
  • Contact

Connect

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • DMCA Policy
  • Code of Conduct

© 2026 ZyVOP. Crafted with care for the developer community.

Made with ❤️ by the ZyVOP team
All systems operational
HomeLinkedIn Scraping with Python: Profiles, Jobs & Company Pages
👍1

LinkedIn Scraping with Python: Profiles, Jobs & Company Pages

Learn how to scrape LinkedIn profiles, jobs, and company pages with Playwright while managing authentication, stealth, rate limits, and account safety.

#LinkedIn#Playwright#Web Scraping#Python#Browser Automation#Data Extraction#Job Scraping#Web Crawling#automation#Data Collection
Z
ZyVOP

Senior Developer

June 3, 2026
10 min read
2 views
LinkedIn Scraping with Python: Profiles, Jobs & Company Pages

LinkedIn is arguably the most valuable professional database on the internet — hundreds of millions of profiles, millions of job postings, and rich company data all in one place. It's no surprise that data scientists, recruiters, and researchers want to access it programmatically.

But LinkedIn is also the hardest major site to scrape. It uses:

  • Heavy JavaScript rendering — almost no meaningful content is in the initial HTML

  • Session-based authentication — most data is only visible when logged in

  • Aggressive bot detection — including TLS fingerprinting, behavioral analysis, and account-level rate limiting

  • Frequent HTML structure changes — CSS selectors break regularly as LinkedIn updates its frontend

This guide gives you a practical, working approach to scraping LinkedIn in 2025. We'll cover two methods: Playwright (recommended) and an overview of the official API as an alternative.

Legal and ethical notice: LinkedIn's Terms of Service prohibit automated scraping. The hiQ v. LinkedIn lawsuit (which argued that publicly available data could be scraped under the Computer Fraud and Abuse Act) has had complex, ongoing outcomes. This guide is for educational purposes. Before scraping LinkedIn for any commercial purpose, consult a lawyer. For most production use cases, LinkedIn's official API or a licensed data provider is a safer choice.


Why Playwright Over Selenium

Both Playwright and Selenium control a real browser, which is necessary for JavaScript-heavy sites. Playwright has several advantages:

Feature

Playwright

Selenium

Speed

Faster async API

Slower

Stealth

Better fingerprint control

Harder to configure

API quality

Modern, intuitive

Older, verbose

Browser support

Chromium, Firefox, WebKit

Chrome, Firefox, Edge

Auto-waiting

Built-in smart waits

Manual waits required

Community (2025)

Rapidly growing

Mature but stagnant

For LinkedIn specifically, Playwright with playwright-stealth is the current best option.


Installation

pip install playwright playwright-stealth pandas
playwright install chromium

This downloads the Chromium browser binary that Playwright will control.


Step 1: Saving Your LinkedIn Session (One-Time Setup)

The golden rule for LinkedIn scraping is: never automate the login itself. LinkedIn's login page is heavily monitored and a bot logging in triggers immediate account flags.

Instead, log in manually once in a Playwright browser, save your session cookies to a file, and reuse those cookies for all future scraping sessions.

# save_cookies.py — run this once manually
import json
from playwright.sync_api import sync_playwright

def save_linkedin_session():
    with sync_playwright() as p:
        # Launch a visible (non-headless) browser so you can log in yourself
        browser = p.chromium.launch(headless=False)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/120.0.0.0 Safari/537.36",
            viewport={"width": 1280, "height": 800}
        )

        page = context.new_page()
        page.goto("https://www.linkedin.com/login")

        # Wait for you to log in manually — you have 60 seconds
        print("Please log in to LinkedIn in the browser window...")
        print("The script will continue once you're on the home feed.")

        # Wait until the feed appears — this confirms a successful login
        page.wait_for_url("**/feed/**", timeout=60000)
        print("Login detected! Saving cookies...")

        cookies = context.cookies()
        with open("linkedin_cookies.json", "w") as f:
            json.dump(cookies, f, indent=2)

        print(f"Saved {len(cookies)} cookies to linkedin_cookies.json")
        browser.close()

save_linkedin_session()

Run this script once, log in manually, and your cookies are saved. These cookies typically stay valid for weeks before LinkedIn requires re-authentication.


Step 2: Loading Your Session in Future Scrapes

import json
from playwright.sync_api import sync_playwright

def create_linkedin_context(playwright):
    """Create a browser context pre-loaded with your saved LinkedIn cookies."""
    browser = playwright.chromium.launch(
        headless=True,  # Can run headless once cookies are saved
        args=[
            "--no-sandbox",
            "--disable-blink-features=AutomationControlled",
        ]
    )

    context = browser.new_context(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/120.0.0.0 Safari/537.36",
        viewport={"width": 1280, "height": 900},
        locale="en-US",
        timezone_id="America/New_York"
    )

    # Load saved cookies
    with open("linkedin_cookies.json") as f:
        cookies = json.load(f)
    context.add_cookies(cookies)

    return browser, context

Step 3: Applying Stealth Mode

Without stealth patches, Playwright exposes dozens of signals that LinkedIn can use to identify it as a bot. The playwright-stealth package patches the most important ones:

from playwright_stealth import stealth_sync

def create_stealthy_page(context):
    """Create a page with stealth patches applied."""
    page = context.new_page()

    # Apply stealth — patches navigator.webdriver, plugins, languages, etc.
    stealth_sync(page)

    return page

The key things stealth patches:

  • navigator.webdriver → set to undefined (normally true in automation)

  • navigator.plugins → adds realistic fake plugins

  • navigator.languages → set to ["en-US", "en"]

  • window.chrome → adds the Chrome runtime object

  • Canvas fingerprinting → adds slight noise to prevent fingerprint matching


Scraping LinkedIn Profiles

A LinkedIn profile URL looks like: https://www.linkedin.com/in/username

import time
import random
import json
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def human_delay(min_sec=1.5, max_sec=4.0):
    """Random delay to mimic human browsing behavior."""
    time.sleep(random.uniform(min_sec, max_sec))

def scroll_to_load(page, scrolls=5):
    """Scroll down the page to trigger lazy-loaded sections."""
    for _ in range(scrolls):
        page.evaluate("window.scrollBy(0, window.innerHeight * 0.8)")
        time.sleep(random.uniform(0.8, 1.5))

def scrape_profile(page, profile_url):
    """Extract structured data from a LinkedIn profile page."""
    page.goto(profile_url, wait_until="domcontentloaded")
    human_delay(2, 3.5)

    # Scroll to load experience, education, and skills sections
    scroll_to_load(page, scrolls=6)
    human_delay(1, 2)

    data = {}

    # --- Basic Info ---
    try:
        data["name"] = page.query_selector("h1").inner_text().strip()
    except:
        data["name"] = None

    try:
        data["headline"] = page.query_selector(
            ".text-body-medium.break-words"
        ).inner_text().strip()
    except:
        data["headline"] = None

    try:
        data["location"] = page.query_selector(
            ".text-body-small.inline.t-black--light.break-words"
        ).inner_text().strip()
    except:
        data["location"] = None

    # --- About Section ---
    try:
        about_el = page.query_selector("#about ~ div .full-width")
        data["about"] = about_el.inner_text().strip() if about_el else None
    except:
        data["about"] = None

    # --- Experience ---
    data["experience"] = []
    try:
        exp_items = page.query_selector_all(
            "#experience ~ div li.artdeco-list__item"
        )
        for item in exp_items:
            title_el = item.query_selector(".t-bold span")
            company_el = item.query_selector(".t-14.t-normal span")
            duration_el = item.query_selector(".t-14.t-normal.t-black--light span")

            data["experience"].append({
                "title": title_el.inner_text().strip() if title_el else None,
                "company": company_el.inner_text().strip() if company_el else None,
                "duration": duration_el.inner_text().strip() if duration_el else None,
            })
    except:
        pass

    # --- Education ---
    data["education"] = []
    try:
        edu_items = page.query_selector_all(
            "#education ~ div li.artdeco-list__item"
        )
        for item in edu_items:
            school_el = item.query_selector(".t-bold span")
            degree_el = item.query_selector(".t-14.t-normal span")

            data["education"].append({
                "school": school_el.inner_text().strip() if school_el else None,
                "degree": degree_el.inner_text().strip() if degree_el else None,
            })
    except:
        pass

    # --- Skills ---
    data["skills"] = []
    try:
        skill_els = page.query_selector_all(
            "#skills ~ div .t-bold span[aria-hidden='true']"
        )
        data["skills"] = [el.inner_text().strip() for el in skill_els[:20]]
    except:
        pass

    data["profile_url"] = profile_url
    return data


# Main execution
def main():
    profile_urls = [
        "https://www.linkedin.com/in/some-public-profile/",
        # Add more profile URLs here
    ]

    results = []

    with sync_playwright() as p:
        browser, context = create_linkedin_context(p)
        page = create_stealthy_page(context)

        for i, url in enumerate(profile_urls):
            print(f"[{i+1}/{len(profile_urls)}] Scraping: {url}")
            profile_data = scrape_profile(page, url)
            results.append(profile_data)
            human_delay(3, 6)  # longer delay between profiles

        browser.close()

    # Save results
    with open("profiles.json", "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

    print(f"Saved {len(results)} profiles to profiles.json")

if __name__ == "__main__":
    main()

Scraping LinkedIn Job Listings

Job listings are slightly easier to scrape than profiles because they are available without being logged in (for most searches). Here is a dedicated job scraper:

import time
import random
import pandas as pd
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def scrape_linkedin_jobs(keywords, location, num_pages=5):
    """
    Scrape LinkedIn job listings for given keywords and location.

    Args:
        keywords: Job search term, e.g. "python developer"
        location: Location string, e.g. "India"
        num_pages: How many pages to scrape (25 jobs per page)
    """
    base_url = (
        "https://www.linkedin.com/jobs/search/"
        f"?keywords={keywords.replace(' ', '%20')}"
        f"&location={location.replace(' ', '%20')}"
        f"&start={{}}"
    )

    all_jobs = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120"
        )
        page = context.new_page()
        stealth_sync(page)

        for page_num in range(num_pages):
            start = page_num * 25
            url = base_url.format(start)
            print(f"Scraping page {page_num + 1}: {url}")

            page.goto(url, wait_until="domcontentloaded")
            time.sleep(random.uniform(2, 4))

            # Scroll to load all job cards
            for _ in range(3):
                page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                time.sleep(1)

            job_cards = page.query_selector_all(".jobs-search__results-list li")

            for card in job_cards:
                try:
                    title = card.query_selector(
                        ".base-search-card__title"
                    ).inner_text().strip()
                    company = card.query_selector(
                        ".base-search-card__subtitle"
                    ).inner_text().strip()
                    location_el = card.query_selector(
                        ".job-search-card__location"
                    )
                    job_location = location_el.inner_text().strip() if location_el else ""
                    link = card.query_selector("a.base-card__full-link")
                    job_url = link.get_attribute("href") if link else ""
                    date_el = card.query_selector("time")
                    posted_date = date_el.get_attribute("datetime") if date_el else ""

                    all_jobs.append({
                        "title": title,
                        "company": company,
                        "location": job_location,
                        "posted_date": posted_date,
                        "url": job_url
                    })
                except Exception as e:
                    continue  # Skip malformed cards

            time.sleep(random.uniform(2, 5))

        browser.close()

    df = pd.DataFrame(all_jobs).drop_duplicates(subset=["url"])
    return df


# Run
df = scrape_linkedin_jobs("python developer", "India", num_pages=4)
df.to_csv("linkedin_jobs.csv", index=False)
print(f"Scraped {len(df)} unique job listings")
print(df[["title", "company", "location"]].head(10))

Scraping Company Pages

Company pages contain employee count, industry, headquarters location, and a "People" section showing employees.

def scrape_company_page(page, company_url):
    """Extract data from a LinkedIn company page."""
    page.goto(company_url, wait_until="domcontentloaded")
    time.sleep(random.uniform(2, 4))

    # Scroll to load all sections
    page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(2)

    data = {"url": company_url}

    # Company name
    try:
        data["name"] = page.query_selector("h1").inner_text().strip()
    except:
        data["name"] = None

    # Tagline / description
    try:
        data["tagline"] = page.query_selector(
            ".org-top-card-summary__tagline"
        ).inner_text().strip()
    except:
        data["tagline"] = None

    # Overview stats (employees, industry, HQ, type)
    data["overview"] = {}
    try:
        overview_items = page.query_selector_all(
            ".org-about-module__margin-bottom"
        )
        for item in overview_items:
            label_el = item.query_selector("dt")
            value_el = item.query_selector("dd")
            if label_el and value_el:
                label = label_el.inner_text().strip()
                value = value_el.inner_text().strip()
                data["overview"][label] = value
    except:
        pass

    # Number of followers
    try:
        followers_el = page.query_selector(
            ".org-top-card-summary-info-list__info-item"
        )
        data["followers"] = followers_el.inner_text().strip() if followers_el else None
    except:
        data["followers"] = None

    return data

Rate Limiting and Account Safety

LinkedIn can permanently ban accounts that scrape aggressively. Follow these guidelines to minimize risk:

Delays between requests:

# Conservative — for accounts you care about
time.sleep(random.uniform(5, 12))

# Moderate
time.sleep(random.uniform(2, 5))

# Risky — only for throwaway accounts
time.sleep(random.uniform(0.5, 1.5))

Daily limits to stay safe:

  • Profile views: under 80–100 per day (LinkedIn counts these)

  • Job page loads: under 200 per day

  • Company pages: under 50 per day

Use a dedicated scraping account: Never scrape with your main personal LinkedIn account. Create a separate account specifically for scraping. If it gets banned, your real professional presence is unaffected.

Rotate your session:

import random

def get_random_user_agent():
    agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/119.0.0.0",
        "Mozilla/5.0 (X11; Linux x86_64) Chrome/118.0.0.0",
    ]
    return random.choice(agents)

The Better Alternative: LinkedIn's Official API

For production use, LinkedIn's official API is more reliable and legally sound. LinkedIn offers:

  • LinkedIn Marketing API — for ad data and company insights

  • LinkedIn Sign In with LinkedIn — OAuth-based profile data access

  • LinkedIn Learning API — for course and learning data

  • LinkedIn Talent Insights — paid enterprise product for HR analytics

Apply at: developer.linkedin.com

For bulk data, licensed data providers like People Data Labs, Clearbit, and Apollo.io aggregate LinkedIn data legally and expose it via clean APIs.


Debugging Common LinkedIn Scraping Issues

Issue: Page loads but the data isn't there LinkedIn lazy-loads content. Always scroll the page before extracting:

page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(2)

Issue: Redirected to login page Your cookies have expired. Re-run save_cookies.py and log in again.

Issue: "Hmm, something went wrong" screen LinkedIn detected automation. Reduce speed, use stealth mode, and add more human-like interactions.

Issue: Getting empty strings from selectors LinkedIn frequently changes its CSS classes. Inspect the current page in DevTools and update your selectors. Using XPath is sometimes more stable than class-based CSS selectors.

# XPath is often more robust for LinkedIn
name = page.query_selector("xpath=//h1[contains(@class, 'top-card')]")

Full Pipeline: Scrape, Save, and Analyse

import json
import pandas as pd
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def run_pipeline(profile_urls):
    results = []

    with sync_playwright() as p:
        browser, context = create_linkedin_context(p)
        page = create_stealthy_page(context)

        for i, url in enumerate(profile_urls):
            print(f"[{i+1}/{len(profile_urls)}] {url}")
            try:
                data = scrape_profile(page, url)
                results.append(data)
            except Exception as e:
                print(f"  Error: {e}")
                results.append({"profile_url": url, "error": str(e)})

            human_delay(4, 8)

        browser.close()

    # Save raw JSON
    with open("profiles_raw.json", "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

    # Flatten to DataFrame (ignoring nested lists for CSV)
    flat = []
    for r in results:
        flat.append({
            "name": r.get("name"),
            "headline": r.get("headline"),
            "location": r.get("location"),
            "about": r.get("about"),
            "experience_count": len(r.get("experience", [])),
            "first_role": r.get("experience", [{}])[0].get("title") if r.get("experience") else None,
            "first_company": r.get("experience", [{}])[0].get("company") if r.get("experience") else None,
            "skills_count": len(r.get("skills", [])),
            "top_skill": r.get("skills", [None])[0],
            "url": r.get("profile_url")
        })

    df = pd.DataFrame(flat)
    df.to_csv("profiles_summary.csv", index=False)
    print(f"\nSaved {len(df)} profiles.")
    print(df[["name", "headline", "first_company"]].head())
    return df

# Run it
urls = [
    "https://www.linkedin.com/in/example-profile-1/",
    "https://www.linkedin.com/in/example-profile-2/",
]
df = run_pipeline(urls)

Summary

Topic

Key takeaway

Authentication

Never automate login — save and reuse cookies

Stealth

Use playwright-stealth to patch 30+ bot signals

Delays

Random delays between 3–8 seconds between profiles

Profiles

Scroll to load, extract name/headline/experience/education

Jobs

Available without login, 25 per page

Company pages

Rich overview stats available in org-about-module

Account safety

Use a dedicated scraping account, stay under 100 profiles/day

Best alternative

LinkedIn API or licensed data providers for production

Z

ZyVOP

Passionate developer sharing knowledge about modern web technologies and best practices.

Comments (0)

Login to post a comment.

Table of Contents

Why Playwright Over SeleniumInstallationStep 1: Saving Your LinkedIn Session (One-Time Setup)Step 2: Loading Your Session in Future ScrapesStep 3: Applying Stealth ModeScraping LinkedIn ProfilesScraping LinkedIn Job ListingsScraping Company PagesRate Limiting and Account SafetyThe Better Alternative: LinkedIn's Official APIDebugging Common LinkedIn Scraping IssuesFull Pipeline: Scrape, Save, and AnalyseSummary

Stay Updated

Get the latest articles delivered to your inbox.

We respect your privacy. Unsubscribe anytime.

Related Posts

Web Scraping with Python: A Complete BeautifulSoup & Requests Guide

Web scraping turns raw HTML into structured data. This guide teaches Python scraping with Requests and BeautifulSoup, covering HTTP requests, HTML parsing, CSS selectors, pagination, retries, robots.txt, data export, and a production-ready scraper.

Read article

Automate Your Code Quality with Git Hooks (And Never Argue in Code Review Again)

Most code review comments should never require a reviewer. This guide shows how to automate formatting, linting, staged-file checks, and commit message validation using Git hooks, Husky, lint-staged, and commitlint before bad code ever reaches your repository.

Read article

Popular Tags

#.env.example Node.js#0x profiling#12-factor#2026#AI agents#AI code security#AI coding tools 2026#AI-assisted development#AI-generated vulnerabilities#ALTER TABLE no lock