Which topics does this article cover?

It highlights LinkedIn, Playwright, Web Scraping, Python, Browser Automation.

LinkedIn Scraping with Python: Profiles, Jobs & Company Pages

Q: What is "LinkedIn Scraping with Python: Profiles, Jobs & Company Pages" about?

LinkedIn is one of the most valuable and difficult websites to scrape. This guide covers Playwright, session cookies, stealth techniques, profile extraction, job scraping, company data collection, rate limiting, and when to use LinkedIn's official API instead.

LinkedIn is arguably the most valuable professional database on the internet — hundreds of millions of profiles, millions of job postings, and rich company data all in one place. It's no surprise that data scientists, recruiters, and researchers want to access it programmatically.

But LinkedIn is also the hardest major site to scrape. It uses:

Heavy JavaScript rendering — almost no meaningful content is in the initial HTML
Session-based authentication — most data is only visible when logged in
Aggressive bot detection — including TLS fingerprinting, behavioral analysis, and account-level rate limiting
Frequent HTML structure changes — CSS selectors break regularly as LinkedIn updates its frontend

This guide gives you a practical, working approach to scraping LinkedIn in 2025. We'll cover two methods: Playwright (recommended) and an overview of the official API as an alternative.

Legal and ethical notice: LinkedIn's Terms of Service prohibit automated scraping. The hiQ v. LinkedIn lawsuit (which argued that publicly available data could be scraped under the Computer Fraud and Abuse Act) has had complex, ongoing outcomes. This guide is for educational purposes. Before scraping LinkedIn for any commercial purpose, consult a lawyer. For most production use cases, LinkedIn's official API or a licensed data provider is a safer choice.

Why Playwright Over Selenium

Both Playwright and Selenium control a real browser, which is necessary for JavaScript-heavy sites. Playwright has several advantages:

Feature	Playwright	Selenium
Speed	Faster async API	Slower
Stealth	Better fingerprint control	Harder to configure
API quality	Modern, intuitive	Older, verbose
Browser support	Chromium, Firefox, WebKit	Chrome, Firefox, Edge
Auto-waiting	Built-in smart waits	Manual waits required
Community (2025)	Rapidly growing	Mature but stagnant

For LinkedIn specifically, Playwright with playwright-stealth is the current best option.

Installation

pip install playwright playwright-stealth pandas
playwright install chromium

This downloads the Chromium browser binary that Playwright will control.

Step 1: Saving Your LinkedIn Session (One-Time Setup)

The golden rule for LinkedIn scraping is: never automate the login itself. LinkedIn's login page is heavily monitored and a bot logging in triggers immediate account flags.

Instead, log in manually once in a Playwright browser, save your session cookies to a file, and reuse those cookies for all future scraping sessions.

# save_cookies.py — run this once manually
import json
from playwright.sync_api import sync_playwright

def save_linkedin_session():
    with sync_playwright() as p:
        # Launch a visible (non-headless) browser so you can log in yourself
        browser = p.chromium.launch(headless=False)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/120.0.0.0 Safari/537.36",
            viewport={"width": 1280, "height": 800}
        )

        page = context.new_page()
        page.goto("https://www.linkedin.com/login")

        # Wait for you to log in manually — you have 60 seconds
        print("Please log in to LinkedIn in the browser window...")
        print("The script will continue once you're on the home feed.")

        # Wait until the feed appears — this confirms a successful login
        page.wait_for_url("**/feed/**", timeout=60000)
        print("Login detected! Saving cookies...")

        cookies = context.cookies()
        with open("linkedin_cookies.json", "w") as f:
            json.dump(cookies, f, indent=2)

        print(f"Saved {len(cookies)} cookies to linkedin_cookies.json")
        browser.close()

save_linkedin_session()

Run this script once, log in manually, and your cookies are saved. These cookies typically stay valid for weeks before LinkedIn requires re-authentication.

Step 2: Loading Your Session in Future Scrapes

import json
from playwright.sync_api import sync_playwright

def create_linkedin_context(playwright):
    """Create a browser context pre-loaded with your saved LinkedIn cookies."""
    browser = playwright.chromium.launch(
        headless=True,  # Can run headless once cookies are saved
        args=[
            "--no-sandbox",
            "--disable-blink-features=AutomationControlled",
        ]
    )

    context = browser.new_context(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/120.0.0.0 Safari/537.36",
        viewport={"width": 1280, "height": 900},
        locale="en-US",
        timezone_id="America/New_York"
    )

    # Load saved cookies
    with open("linkedin_cookies.json") as f:
        cookies = json.load(f)
    context.add_cookies(cookies)

    return browser, context

Step 3: Applying Stealth Mode

Without stealth patches, Playwright exposes dozens of signals that LinkedIn can use to identify it as a bot. The playwright-stealth package patches the most important ones:

from playwright_stealth import stealth_sync

def create_stealthy_page(context):
    """Create a page with stealth patches applied."""
    page = context.new_page()

    # Apply stealth — patches navigator.webdriver, plugins, languages, etc.
    stealth_sync(page)

    return page

The key things stealth patches:

navigator.webdriver → set to undefined (normally true in automation)
navigator.plugins → adds realistic fake plugins
navigator.languages → set to ["en-US", "en"]
window.chrome → adds the Chrome runtime object
Canvas fingerprinting → adds slight noise to prevent fingerprint matching

Scraping LinkedIn Profiles

A LinkedIn profile URL looks like: https://www.linkedin.com/in/username

import time
import random
import json
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def human_delay(min_sec=1.5, max_sec=4.0):
    """Random delay to mimic human browsing behavior."""
    time.sleep(random.uniform(min_sec, max_sec))

def scroll_to_load(page, scrolls=5):
    """Scroll down the page to trigger lazy-loaded sections."""
    for _ in range(scrolls):
        page.evaluate("window.scrollBy(0, window.innerHeight * 0.8)")
        time.sleep(random.uniform(0.8, 1.5))

def scrape_profile(page, profile_url):
    """Extract structured data from a LinkedIn profile page."""
    page.goto(profile_url, wait_until="domcontentloaded")
    human_delay(2, 3.5)

    # Scroll to load experience, education, and skills sections
    scroll_to_load(page, scrolls=6)
    human_delay(1, 2)

    data = {}

    # --- Basic Info ---
    try:
        data["name"] = page.query_selector("h1").inner_text().strip()
    except:
        data["name"] = None

    try:
        data["headline"] = page.query_selector(
            ".text-body-medium.break-words"
        ).inner_text().strip()
    except:
        data["headline"] = None

    try:
        data["location"] = page.query_selector(
            ".text-body-small.inline.t-black--light.break-words"
        ).inner_text().strip()
    except:
        data["location"] = None

    # --- About Section ---
    try:
        about_el = page.query_selector("#about ~ div .full-width")
        data["about"] = about_el.inner_text().strip() if about_el else None
    except:
        data["about"] = None

    # --- Experience ---
    data["experience"] = []
    try:
        exp_items = page.query_selector_all(
            "#experience ~ div li.artdeco-list__item"
        )
        for item in exp_items:
            title_el = item.query_selector(".t-bold span")
            company_el = item.query_selector(".t-14.t-normal span")
            duration_el = item.query_selector(".t-14.t-normal.t-black--light span")

            data["experience"].append({
                "title": title_el.inner_text().strip() if title_el else None,
                "company": company_el.inner_text().strip() if company_el else None,
                "duration": duration_el.inner_text().strip() if duration_el else None,
            })
    except:
        pass

    # --- Education ---
    data["education"] = []
    try:
        edu_items = page.query_selector_all(
            "#education ~ div li.artdeco-list__item"
        )
        for item in edu_items:
            school_el = item.query_selector(".t-bold span")
            degree_el = item.query_selector(".t-14.t-normal span")

            data["education"].append({
                "school": school_el.inner_text().strip() if school_el else None,
                "degree": degree_el.inner_text().strip() if degree_el else None,
            })
    except:
        pass

    # --- Skills ---
    data["skills"] = []
    try:
        skill_els = page.query_selector_all(
            "#skills ~ div .t-bold span[aria-hidden='true']"
        )
        data["skills"] = [el.inner_text().strip() for el in skill_els[:20]]
    except:
        pass

    data["profile_url"] = profile_url
    return data


# Main execution
def main():
    profile_urls = [
        "https://www.linkedin.com/in/some-public-profile/",
        # Add more profile URLs here
    ]

    results = []

    with sync_playwright() as p:
        browser, context = create_linkedin_context(p)
        page = create_stealthy_page(context)

        for i, url in enumerate(profile_urls):
            print(f"[{i+1}/{len(profile_urls)}] Scraping: {url}")
            profile_data = scrape_profile(page, url)
            results.append(profile_data)
            human_delay(3, 6)  # longer delay between profiles

        browser.close()

    # Save results
    with open("profiles.json", "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

    print(f"Saved {len(results)} profiles to profiles.json")

if __name__ == "__main__":
    main()

Scraping LinkedIn Job Listings

Job listings are slightly easier to scrape than profiles because they are available without being logged in (for most searches). Here is a dedicated job scraper:

import time
import random
import pandas as pd
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def scrape_linkedin_jobs(keywords, location, num_pages=5):
    """
    Scrape LinkedIn job listings for given keywords and location.

    Args:
        keywords: Job search term, e.g. "python developer"
        location: Location string, e.g. "India"
        num_pages: How many pages to scrape (25 jobs per page)
    """
    base_url = (
        "https://www.linkedin.com/jobs/search/"
        f"?keywords={keywords.replace(' ', '%20')}"
        f"&location={location.replace(' ', '%20')}"
        f"&start={{}}"
    )

    all_jobs = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120"
        )
        page = context.new_page()
        stealth_sync(page)

        for page_num in range(num_pages):
            start = page_num * 25
            url = base_url.format(start)
            print(f"Scraping page {page_num + 1}: {url}")

            page.goto(url, wait_until="domcontentloaded")
            time.sleep(random.uniform(2, 4))

            # Scroll to load all job cards
            for _ in range(3):
                page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                time.sleep(1)

            job_cards = page.query_selector_all(".jobs-search__results-list li")

            for card in job_cards:
                try:
                    title = card.query_selector(
                        ".base-search-card__title"
                    ).inner_text().strip()
                    company = card.query_selector(
                        ".base-search-card__subtitle"
                    ).inner_text().strip()
                    location_el = card.query_selector(
                        ".job-search-card__location"
                    )
                    job_location = location_el.inner_text().strip() if location_el else ""
                    link = card.query_selector("a.base-card__full-link")
                    job_url = link.get_attribute("href") if link else ""
                    date_el = card.query_selector("time")
                    posted_date = date_el.get_attribute("datetime") if date_el else ""

                    all_jobs.append({
                        "title": title,
                        "company": company,
                        "location": job_location,
                        "posted_date": posted_date,
                        "url": job_url
                    })
                except Exception as e:
                    continue  # Skip malformed cards

            time.sleep(random.uniform(2, 5))

        browser.close()

    df = pd.DataFrame(all_jobs).drop_duplicates(subset=["url"])
    return df


# Run
df = scrape_linkedin_jobs("python developer", "India", num_pages=4)
df.to_csv("linkedin_jobs.csv", index=False)
print(f"Scraped {len(df)} unique job listings")
print(df[["title", "company", "location"]].head(10))

Scraping Company Pages

Company pages contain employee count, industry, headquarters location, and a "People" section showing employees.

def scrape_company_page(page, company_url):
    """Extract data from a LinkedIn company page."""
    page.goto(company_url, wait_until="domcontentloaded")
    time.sleep(random.uniform(2, 4))

    # Scroll to load all sections
    page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(2)

    data = {"url": company_url}

    # Company name
    try:
        data["name"] = page.query_selector("h1").inner_text().strip()
    except:
        data["name"] = None

    # Tagline / description
    try:
        data["tagline"] = page.query_selector(
            ".org-top-card-summary__tagline"
        ).inner_text().strip()
    except:
        data["tagline"] = None

    # Overview stats (employees, industry, HQ, type)
    data["overview"] = {}
    try:
        overview_items = page.query_selector_all(
            ".org-about-module__margin-bottom"
        )
        for item in overview_items:
            label_el = item.query_selector("dt")
            value_el = item.query_selector("dd")
            if label_el and value_el:
                label = label_el.inner_text().strip()
                value = value_el.inner_text().strip()
                data["overview"][label] = value
    except:
        pass

    # Number of followers
    try:
        followers_el = page.query_selector(
            ".org-top-card-summary-info-list__info-item"
        )
        data["followers"] = followers_el.inner_text().strip() if followers_el else None
    except:
        data["followers"] = None

    return data

Rate Limiting and Account Safety

LinkedIn can permanently ban accounts that scrape aggressively. Follow these guidelines to minimize risk:

Delays between requests:

# Conservative — for accounts you care about
time.sleep(random.uniform(5, 12))

# Moderate
time.sleep(random.uniform(2, 5))

# Risky — only for throwaway accounts
time.sleep(random.uniform(0.5, 1.5))

Daily limits to stay safe:

Profile views: under 80–100 per day (LinkedIn counts these)
Job page loads: under 200 per day
Company pages: under 50 per day

Use a dedicated scraping account: Never scrape with your main personal LinkedIn account. Create a separate account specifically for scraping. If it gets banned, your real professional presence is unaffected.

Rotate your session:

import random

def get_random_user_agent():
    agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/119.0.0.0",
        "Mozilla/5.0 (X11; Linux x86_64) Chrome/118.0.0.0",
    ]
    return random.choice(agents)

The Better Alternative: LinkedIn's Official API

For production use, LinkedIn's official API is more reliable and legally sound. LinkedIn offers:

LinkedIn Marketing API — for ad data and company insights
LinkedIn Sign In with LinkedIn — OAuth-based profile data access
LinkedIn Learning API — for course and learning data
LinkedIn Talent Insights — paid enterprise product for HR analytics

Apply at: developer.linkedin.com

For bulk data, licensed data providers like People Data Labs, Clearbit, and Apollo.io aggregate LinkedIn data legally and expose it via clean APIs.

Debugging Common LinkedIn Scraping Issues

Issue: Page loads but the data isn't there LinkedIn lazy-loads content. Always scroll the page before extracting:

page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(2)

Issue: Redirected to login page Your cookies have expired. Re-run save_cookies.py and log in again.

Issue: "Hmm, something went wrong" screen LinkedIn detected automation. Reduce speed, use stealth mode, and add more human-like interactions.

Issue: Getting empty strings from selectors LinkedIn frequently changes its CSS classes. Inspect the current page in DevTools and update your selectors. Using XPath is sometimes more stable than class-based CSS selectors.

# XPath is often more robust for LinkedIn
name = page.query_selector("xpath=//h1[contains(@class, 'top-card')]")

Full Pipeline: Scrape, Save, and Analyse

import json
import pandas as pd
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def run_pipeline(profile_urls):
    results = []

    with sync_playwright() as p:
        browser, context = create_linkedin_context(p)
        page = create_stealthy_page(context)

        for i, url in enumerate(profile_urls):
            print(f"[{i+1}/{len(profile_urls)}] {url}")
            try:
                data = scrape_profile(page, url)
                results.append(data)
            except Exception as e:
                print(f"  Error: {e}")
                results.append({"profile_url": url, "error": str(e)})

            human_delay(4, 8)

        browser.close()

    # Save raw JSON
    with open("profiles_raw.json", "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

    # Flatten to DataFrame (ignoring nested lists for CSV)
    flat = []
    for r in results:
        flat.append({
            "name": r.get("name"),
            "headline": r.get("headline"),
            "location": r.get("location"),
            "about": r.get("about"),
            "experience_count": len(r.get("experience", [])),
            "first_role": r.get("experience", [{}])[0].get("title") if r.get("experience") else None,
            "first_company": r.get("experience", [{}])[0].get("company") if r.get("experience") else None,
            "skills_count": len(r.get("skills", [])),
            "top_skill": r.get("skills", [None])[0],
            "url": r.get("profile_url")
        })

    df = pd.DataFrame(flat)
    df.to_csv("profiles_summary.csv", index=False)
    print(f"\nSaved {len(df)} profiles.")
    print(df[["name", "headline", "first_company"]].head())
    return df

# Run it
urls = [
    "https://www.linkedin.com/in/example-profile-1/",
    "https://www.linkedin.com/in/example-profile-2/",
]
df = run_pipeline(urls)

Summary

Topic	Key takeaway
Authentication	Never automate login — save and reuse cookies
Stealth	Use `playwright-stealth` to patch 30+ bot signals
Delays	Random delays between 3–8 seconds between profiles
Profiles	Scroll to load, extract name/headline/experience/education
Jobs	Available without login, 25 per page
Company pages	Rich overview stats available in `org-about-module`
Account safety	Use a dedicated scraping account, stay under 100 profiles/day
Best alternative	LinkedIn API or licensed data providers for production

But LinkedIn is also the hardest major site to scrape. It uses:

Heavy JavaScript rendering — almost no meaningful content is in the initial HTML
Session-based authentication — most data is only visible when logged in
Aggressive bot detection — including TLS fingerprinting, behavioral analysis, and account-level rate limiting
Frequent HTML structure changes — CSS selectors break regularly as LinkedIn updates its frontend

This guide gives you a practical, working approach to scraping LinkedIn in 2025. We'll cover two methods: Playwright (recommended) and an overview of the official API as an alternative.

Legal and ethical notice: LinkedIn's Terms of Service prohibit automated scraping. The hiQ v. LinkedIn lawsuit (which argued that publicly available data could be scraped under the Computer Fraud and Abuse Act) has had complex, ongoing outcomes. This guide is for educational purposes. Before scraping LinkedIn for any commercial purpose, consult a lawyer. For most production use cases, LinkedIn's official API or a licensed data provider is a safer choice.

Why Playwright Over Selenium

Both Playwright and Selenium control a real browser, which is necessary for JavaScript-heavy sites. Playwright has several advantages:

Feature	Playwright	Selenium
Speed	Faster async API	Slower
Stealth	Better fingerprint control	Harder to configure
API quality	Modern, intuitive	Older, verbose
Browser support	Chromium, Firefox, WebKit	Chrome, Firefox, Edge
Auto-waiting	Built-in smart waits	Manual waits required
Community (2025)	Rapidly growing	Mature but stagnant

For LinkedIn specifically, Playwright with playwright-stealth is the current best option.

Installation

pip install playwright playwright-stealth pandas
playwright install chromium

This downloads the Chromium browser binary that Playwright will control.

Step 1: Saving Your LinkedIn Session (One-Time Setup)

The golden rule for LinkedIn scraping is: never automate the login itself. LinkedIn's login page is heavily monitored and a bot logging in triggers immediate account flags.

Instead, log in manually once in a Playwright browser, save your session cookies to a file, and reuse those cookies for all future scraping sessions.

# save_cookies.py — run this once manually
import json
from playwright.sync_api import sync_playwright

def save_linkedin_session():
    with sync_playwright() as p:
        # Launch a visible (non-headless) browser so you can log in yourself
        browser = p.chromium.launch(headless=False)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/120.0.0.0 Safari/537.36",
            viewport={"width": 1280, "height": 800}
        )

        page = context.new_page()
        page.goto("https://www.linkedin.com/login")

        # Wait for you to log in manually — you have 60 seconds
        print("Please log in to LinkedIn in the browser window...")
        print("The script will continue once you're on the home feed.")

        # Wait until the feed appears — this confirms a successful login
        page.wait_for_url("**/feed/**", timeout=60000)
        print("Login detected! Saving cookies...")

        cookies = context.cookies()
        with open("linkedin_cookies.json", "w") as f:
            json.dump(cookies, f, indent=2)

        print(f"Saved {len(cookies)} cookies to linkedin_cookies.json")
        browser.close()

save_linkedin_session()

Run this script once, log in manually, and your cookies are saved. These cookies typically stay valid for weeks before LinkedIn requires re-authentication.

Step 2: Loading Your Session in Future Scrapes

import json
from playwright.sync_api import sync_playwright

def create_linkedin_context(playwright):
    """Create a browser context pre-loaded with your saved LinkedIn cookies."""
    browser = playwright.chromium.launch(
        headless=True,  # Can run headless once cookies are saved
        args=[
            "--no-sandbox",
            "--disable-blink-features=AutomationControlled",
        ]
    )

    context = browser.new_context(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/120.0.0.0 Safari/537.36",
        viewport={"width": 1280, "height": 900},
        locale="en-US",
        timezone_id="America/New_York"
    )

    # Load saved cookies
    with open("linkedin_cookies.json") as f:
        cookies = json.load(f)
    context.add_cookies(cookies)

    return browser, context

Step 3: Applying Stealth Mode

Without stealth patches, Playwright exposes dozens of signals that LinkedIn can use to identify it as a bot. The playwright-stealth package patches the most important ones:

from playwright_stealth import stealth_sync

def create_stealthy_page(context):
    """Create a page with stealth patches applied."""
    page = context.new_page()

    # Apply stealth — patches navigator.webdriver, plugins, languages, etc.
    stealth_sync(page)

    return page

The key things stealth patches:

navigator.webdriver → set to undefined (normally true in automation)
navigator.plugins → adds realistic fake plugins
navigator.languages → set to ["en-US", "en"]
window.chrome → adds the Chrome runtime object
Canvas fingerprinting → adds slight noise to prevent fingerprint matching

Scraping LinkedIn Profiles

A LinkedIn profile URL looks like: https://www.linkedin.com/in/username

import time
import random
import json
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def human_delay(min_sec=1.5, max_sec=4.0):
    """Random delay to mimic human browsing behavior."""
    time.sleep(random.uniform(min_sec, max_sec))

def scroll_to_load(page, scrolls=5):
    """Scroll down the page to trigger lazy-loaded sections."""
    for _ in range(scrolls):
        page.evaluate("window.scrollBy(0, window.innerHeight * 0.8)")
        time.sleep(random.uniform(0.8, 1.5))

def scrape_profile(page, profile_url):
    """Extract structured data from a LinkedIn profile page."""
    page.goto(profile_url, wait_until="domcontentloaded")
    human_delay(2, 3.5)

    # Scroll to load experience, education, and skills sections
    scroll_to_load(page, scrolls=6)
    human_delay(1, 2)

    data = {}

    # --- Basic Info ---
    try:
        data["name"] = page.query_selector("h1").inner_text().strip()
    except:
        data["name"] = None

    try:
        data["headline"] = page.query_selector(
            ".text-body-medium.break-words"
        ).inner_text().strip()
    except:
        data["headline"] = None

    try:
        data["location"] = page.query_selector(
            ".text-body-small.inline.t-black--light.break-words"
        ).inner_text().strip()
    except:
        data["location"] = None

    # --- About Section ---
    try:
        about_el = page.query_selector("#about ~ div .full-width")
        data["about"] = about_el.inner_text().strip() if about_el else None
    except:
        data["about"] = None

    # --- Experience ---
    data["experience"] = []
    try:
        exp_items = page.query_selector_all(
            "#experience ~ div li.artdeco-list__item"
        )
        for item in exp_items:
            title_el = item.query_selector(".t-bold span")
            company_el = item.query_selector(".t-14.t-normal span")
            duration_el = item.query_selector(".t-14.t-normal.t-black--light span")

            data["experience"].append({
                "title": title_el.inner_text().strip() if title_el else None,
                "company": company_el.inner_text().strip() if company_el else None,
                "duration": duration_el.inner_text().strip() if duration_el else None,
            })
    except:
        pass

    # --- Education ---
    data["education"] = []
    try:
        edu_items = page.query_selector_all(
            "#education ~ div li.artdeco-list__item"
        )
        for item in edu_items:
            school_el = item.query_selector(".t-bold span")
            degree_el = item.query_selector(".t-14.t-normal span")

            data["education"].append({
                "school": school_el.inner_text().strip() if school_el else None,
                "degree": degree_el.inner_text().strip() if degree_el else None,
            })
    except:
        pass

    # --- Skills ---
    data["skills"] = []
    try:
        skill_els = page.query_selector_all(
            "#skills ~ div .t-bold span[aria-hidden='true']"
        )
        data["skills"] = [el.inner_text().strip() for el in skill_els[:20]]
    except:
        pass

    data["profile_url"] = profile_url
    return data


# Main execution
def main():
    profile_urls = [
        "https://www.linkedin.com/in/some-public-profile/",
        # Add more profile URLs here
    ]

    results = []

    with sync_playwright() as p:
        browser, context = create_linkedin_context(p)
        page = create_stealthy_page(context)

        for i, url in enumerate(profile_urls):
            print(f"[{i+1}/{len(profile_urls)}] Scraping: {url}")
            profile_data = scrape_profile(page, url)
            results.append(profile_data)
            human_delay(3, 6)  # longer delay between profiles

        browser.close()

    # Save results
    with open("profiles.json", "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

    print(f"Saved {len(results)} profiles to profiles.json")

if __name__ == "__main__":
    main()

Scraping LinkedIn Job Listings

Job listings are slightly easier to scrape than profiles because they are available without being logged in (for most searches). Here is a dedicated job scraper:

import time
import random
import pandas as pd
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def scrape_linkedin_jobs(keywords, location, num_pages=5):
    """
    Scrape LinkedIn job listings for given keywords and location.

    Args:
        keywords: Job search term, e.g. "python developer"
        location: Location string, e.g. "India"
        num_pages: How many pages to scrape (25 jobs per page)
    """
    base_url = (
        "https://www.linkedin.com/jobs/search/"
        f"?keywords={keywords.replace(' ', '%20')}"
        f"&location={location.replace(' ', '%20')}"
        f"&start={{}}"
    )

    all_jobs = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120"
        )
        page = context.new_page()
        stealth_sync(page)

        for page_num in range(num_pages):
            start = page_num * 25
            url = base_url.format(start)
            print(f"Scraping page {page_num + 1}: {url}")

            page.goto(url, wait_until="domcontentloaded")
            time.sleep(random.uniform(2, 4))

            # Scroll to load all job cards
            for _ in range(3):
                page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                time.sleep(1)

            job_cards = page.query_selector_all(".jobs-search__results-list li")

            for card in job_cards:
                try:
                    title = card.query_selector(
                        ".base-search-card__title"
                    ).inner_text().strip()
                    company = card.query_selector(
                        ".base-search-card__subtitle"
                    ).inner_text().strip()
                    location_el = card.query_selector(
                        ".job-search-card__location"
                    )
                    job_location = location_el.inner_text().strip() if location_el else ""
                    link = card.query_selector("a.base-card__full-link")
                    job_url = link.get_attribute("href") if link else ""
                    date_el = card.query_selector("time")
                    posted_date = date_el.get_attribute("datetime") if date_el else ""

                    all_jobs.append({
                        "title": title,
                        "company": company,
                        "location": job_location,
                        "posted_date": posted_date,
                        "url": job_url
                    })
                except Exception as e:
                    continue  # Skip malformed cards

            time.sleep(random.uniform(2, 5))

        browser.close()

    df = pd.DataFrame(all_jobs).drop_duplicates(subset=["url"])
    return df


# Run
df = scrape_linkedin_jobs("python developer", "India", num_pages=4)
df.to_csv("linkedin_jobs.csv", index=False)
print(f"Scraped {len(df)} unique job listings")
print(df[["title", "company", "location"]].head(10))

Scraping Company Pages

Company pages contain employee count, industry, headquarters location, and a "People" section showing employees.

def scrape_company_page(page, company_url):
    """Extract data from a LinkedIn company page."""
    page.goto(company_url, wait_until="domcontentloaded")
    time.sleep(random.uniform(2, 4))

    # Scroll to load all sections
    page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(2)

    data = {"url": company_url}

    # Company name
    try:
        data["name"] = page.query_selector("h1").inner_text().strip()
    except:
        data["name"] = None

    # Tagline / description
    try:
        data["tagline"] = page.query_selector(
            ".org-top-card-summary__tagline"
        ).inner_text().strip()
    except:
        data["tagline"] = None

    # Overview stats (employees, industry, HQ, type)
    data["overview"] = {}
    try:
        overview_items = page.query_selector_all(
            ".org-about-module__margin-bottom"
        )
        for item in overview_items:
            label_el = item.query_selector("dt")
            value_el = item.query_selector("dd")
            if label_el and value_el:
                label = label_el.inner_text().strip()
                value = value_el.inner_text().strip()
                data["overview"][label] = value
    except:
        pass

    # Number of followers
    try:
        followers_el = page.query_selector(
            ".org-top-card-summary-info-list__info-item"
        )
        data["followers"] = followers_el.inner_text().strip() if followers_el else None
    except:
        data["followers"] = None

    return data

Rate Limiting and Account Safety

LinkedIn can permanently ban accounts that scrape aggressively. Follow these guidelines to minimize risk:

Delays between requests:

# Conservative — for accounts you care about
time.sleep(random.uniform(5, 12))

# Moderate
time.sleep(random.uniform(2, 5))

# Risky — only for throwaway accounts
time.sleep(random.uniform(0.5, 1.5))

Daily limits to stay safe:

Profile views: under 80–100 per day (LinkedIn counts these)
Job page loads: under 200 per day
Company pages: under 50 per day

Rotate your session:

import random

def get_random_user_agent():
    agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/119.0.0.0",
        "Mozilla/5.0 (X11; Linux x86_64) Chrome/118.0.0.0",
    ]
    return random.choice(agents)

The Better Alternative: LinkedIn's Official API

For production use, LinkedIn's official API is more reliable and legally sound. LinkedIn offers:

LinkedIn Marketing API — for ad data and company insights
LinkedIn Sign In with LinkedIn — OAuth-based profile data access
LinkedIn Learning API — for course and learning data
LinkedIn Talent Insights — paid enterprise product for HR analytics

Apply at: developer.linkedin.com

For bulk data, licensed data providers like People Data Labs, Clearbit, and Apollo.io aggregate LinkedIn data legally and expose it via clean APIs.

Debugging Common LinkedIn Scraping Issues

Issue: Page loads but the data isn't there LinkedIn lazy-loads content. Always scroll the page before extracting:

page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(2)

Issue: Redirected to login page Your cookies have expired. Re-run save_cookies.py and log in again.

Issue: "Hmm, something went wrong" screen LinkedIn detected automation. Reduce speed, use stealth mode, and add more human-like interactions.

# XPath is often more robust for LinkedIn
name = page.query_selector("xpath=//h1[contains(@class, 'top-card')]")

Full Pipeline: Scrape, Save, and Analyse

import json
import pandas as pd
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def run_pipeline(profile_urls):
    results = []

    with sync_playwright() as p:
        browser, context = create_linkedin_context(p)
        page = create_stealthy_page(context)

        for i, url in enumerate(profile_urls):
            print(f"[{i+1}/{len(profile_urls)}] {url}")
            try:
                data = scrape_profile(page, url)
                results.append(data)
            except Exception as e:
                print(f"  Error: {e}")
                results.append({"profile_url": url, "error": str(e)})

            human_delay(4, 8)

        browser.close()

    # Save raw JSON
    with open("profiles_raw.json", "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

    # Flatten to DataFrame (ignoring nested lists for CSV)
    flat = []
    for r in results:
        flat.append({
            "name": r.get("name"),
            "headline": r.get("headline"),
            "location": r.get("location"),
            "about": r.get("about"),
            "experience_count": len(r.get("experience", [])),
            "first_role": r.get("experience", [{}])[0].get("title") if r.get("experience") else None,
            "first_company": r.get("experience", [{}])[0].get("company") if r.get("experience") else None,
            "skills_count": len(r.get("skills", [])),
            "top_skill": r.get("skills", [None])[0],
            "url": r.get("profile_url")
        })

    df = pd.DataFrame(flat)
    df.to_csv("profiles_summary.csv", index=False)
    print(f"\nSaved {len(df)} profiles.")
    print(df[["name", "headline", "first_company"]].head())
    return df

# Run it
urls = [
    "https://www.linkedin.com/in/example-profile-1/",
    "https://www.linkedin.com/in/example-profile-2/",
]
df = run_pipeline(urls)

Summary

Topic	Key takeaway
Authentication	Never automate login — save and reuse cookies
Stealth	Use `playwright-stealth` to patch 30+ bot signals
Delays	Random delays between 3–8 seconds between profiles
Profiles	Scroll to load, extract name/headline/experience/education
Jobs	Available without login, 25 per page
Company pages	Rich overview stats available in `org-about-module`
Account safety	Use a dedicated scraping account, stay under 100 profiles/day
Best alternative	LinkedIn API or licensed data providers for production

LinkedIn Scraping with Python: Profiles, Jobs & Company Pages

Why Playwright Over Selenium

Installation

Step 1: Saving Your LinkedIn Session (One-Time Setup)

Step 2: Loading Your Session in Future Scrapes

Step 3: Applying Stealth Mode

Scraping LinkedIn Profiles

Scraping LinkedIn Job Listings

Scraping Company Pages

Rate Limiting and Account Safety

The Better Alternative: LinkedIn's Official API

Debugging Common LinkedIn Scraping Issues

Full Pipeline: Scrape, Save, and Analyse

Summary

ZyVOP

Comments (0)

LinkedIn Scraping with Python: Profiles, Jobs & Company Pages

Why Playwright Over Selenium

Installation

Step 1: Saving Your LinkedIn Session (One-Time Setup)

Step 2: Loading Your Session in Future Scrapes

Step 3: Applying Stealth Mode

Scraping LinkedIn Profiles

Scraping LinkedIn Job Listings

Scraping Company Pages

Rate Limiting and Account Safety

The Better Alternative: LinkedIn's Official API

Debugging Common LinkedIn Scraping Issues

Full Pipeline: Scrape, Save, and Analyse

Summary

ZyVOP

Comments (0)

Related Posts

AI Agents Go Mainstream: From Clumsy Code Bots to Energy‑Hungry Data Centers | The AI Daily Roundup

Linear Algebra Essentials: The Math Every Model in This Series Has Been Hiding

Statistics Essentials for Machine Learning: Mean, Variance, Sampling, and Significance

Data Quality: Why a 96% Accurate Model Can Still Be Completely Useless

Types of Machine Learning Explained: Supervised vs. Unsupervised vs. Reinforcement Learning

Popular Tags