How to Scrape News Articles: A Practical Guide

News article scraping pipeline diagram showing a section page feeding article URLs into structured JSON output with Python, JavaScript, and Ruby logos

By the end of this guide, you'll have a working news scraper that pulls headlines, article URLs, bylines, publish dates, and full article text from real news sites. We'll start with one API call, then build a full pipeline that walks from a section page to every article on it.

Scraping news is different from scraping a static blog. News homepages rebuild themselves every few minutes, ship layout changes without warning, hide content behind paywalls, and run some of the most aggressive anti-bot systems on the web. A requests.get() call that worked last week returns a CAPTCHA page this week.

This tutorial takes a different route. Instead of fighting each site with a custom proxy script, we'll use a cloud browser API that loads the page like a real browser and returns structured JSON. You send a URL and an extraction schema. You get back clean data. Every example uses a real, public news source and returns real data.

What you'll build in this guide:

A one-call news scraper that returns the current top stories as JSON
A section scraper that pulls every article link off a publisher's news page
An article scraper that extracts the headline, byline, date, and body text
A headline scraper for major publishers like the BBC that survives class-name churn
A reusable section-to-article pipeline you can point at most news sites
CSV and JSON export plus a change-detection pattern for monitoring

TL;DR: To scrape news articles, load each page in a real browser and extract structured data with a schema instead of parsing raw HTML. Use CSS selectors for stable sites, AI selectors for messy layouts, and residential proxies for publishers with strong anti-bot defenses. Browserbeam handles the browser, the proxy, and the extraction in a single API call, so ten lines of code replace a Selenium plus proxy-rotation stack.

Quick Start: Scrape News Headlines in One API Call

Here's a complete news scraper. It pulls the current top stories from Hacker News, a public news aggregator with clean, stable HTML. Replace YOUR_API_KEY and run it.

Don't have an API key yet? Create a free Browserbeam account. You get 5,000 credits and no credit card is required.

curl -s -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "steps": [
      {"extract": {"stories": [{
        "_parent": ".athing",
        "_limit": 10,
        "rank": ".rank >> text",
        "title": ".titleline > a >> text",
        "url": ".titleline > a >> href"
      }]}},
      {"close": {}}
    ]
  }' | jq '.extraction.stories'

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(
    url="https://news.ycombinator.com",
    steps=[
        {"extract": {"stories": [{
            "_parent": ".athing",
            "_limit": 10,
            "rank": ".rank >> text",
            "title": ".titleline > a >> text",
            "url": ".titleline > a >> href",
        }]}},
    ],
)

for story in session.extraction["stories"]:
    print(story["rank"], story["title"])
session.close()

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({
  url: "https://news.ycombinator.com",
  steps: [
    { extract: { stories: [{
      _parent: ".athing",
      _limit: 10,
      rank: ".rank >> text",
      title: ".titleline > a >> text",
      url: ".titleline > a >> href",
    }] } },
  ],
});

const stories = session.extraction?.stories as Array<Record<string, string>>;
for (const story of stories) {
  console.log(story.rank, story.title);
}
await session.close();

require "browserbeam"

client = Browserbeam::Client.new(api_key: "YOUR_API_KEY")
session = client.sessions.create(
  url: "https://news.ycombinator.com",
  steps: [
    { extract: { stories: [{
      _parent: ".athing",
      _limit: 10,
      rank: ".rank >> text",
      title: ".titleline > a >> text",
      url: ".titleline > a >> href"
    }] } }
  ]
)

session.extraction["stories"].each do |story|
  puts "#{story['rank']} #{story['title']}"
end
session.close

The response is structured JSON, ready to use:

[
  {
    "rank": "1.",
    "title": "OpenAI unveils its first custom chip, built by Broadcom",
    "url": "https://techcrunch.com/2026/06/24/openai-unveils-its-first-custom-chip-built-by-broadcom/"
  },
  {
    "rank": "2.",
    "title": "Show HN: I built a local-first note app",
    "url": "https://example-startup.com/notes"
  }
]

That is the whole loop: one call, real data, no browser to manage. The rest of this guide builds on that pattern for harder targets.

What Data Can You Extract from News Sites?

News pages follow predictable structures once you know where to look. Each page type carries a different set of fields.

Page Type	URL Pattern	Available Fields
Aggregator / front page	`news.ycombinator.com`, a publisher homepage	Rank, headline, article URL, source, points or comment count
Section page	`/sections/news/`, `/world`, `/business`	Headline, article URL, teaser, publish date, section
Article page	`/2026/06/24/.../slug`	Headline, byline, publish date, body paragraphs, tags, related links
Topic / tag page	`/topic/ai`, `/tag/elections`	Headline, article URL, date, summary
Search results	`/search?q=...`	Headline, article URL, date, snippet

The richest data lives on article pages: full body text, author, timestamps, and structured metadata. The fastest data to collect lives on section pages, where you can pull dozens of article links in a single request. A good pipeline uses both. Pull links from sections, then visit each article for the details.

Structured Metadata Hiding in the Page

Most news sites embed a <script type="application/ld+json"> block following Schema.org's NewsArticle type. It carries the headline, author, publish date, and sometimes the full body in one clean JSON object. When a site includes it, parsing that block is more stable than chasing CSS classes. We use a CSS-based approach below because it works everywhere, but keep ld+json in mind as a fallback for sites with messy markup.

Why News Sites Are Harder to Scrape Than Blogs

If you have scraped a documentation site or a small blog, news sites will surprise you. Here is what trips up most scrapers, and the fix for each.

Challenge	What Causes It	The Fix
Aggressive anti-bot	Publishers fingerprint headless browsers and datacenter IPs	Use a real browser with residential proxies
Class-name churn	Build tools generate hashed classes like `css-9mylee` that change on deploy	Target `data-testid` attributes or use AI selectors
Paywalls	Body text is gated behind a subscription wall	Scrape only metadata and the free preview; respect the publisher's terms
JavaScript rendering	Headlines load client-side after the initial HTML	Load the page in a browser, not an HTTP client
Rapid content rotation	The front page changes every few minutes	Scrape on a schedule and store snapshots for diffing

The pattern across all of these: a raw HTTP request sees a stripped-down or blocked version of the page. A real browser sees what a reader sees. That single difference solves most news scraping problems.

A note on ethics and the law: Scrape public data, respect robots.txt, and read each publisher's terms of service. Do not bypass paywalls to redistribute paid content. Collecting headlines and metadata for research, monitoring, or personal use is common practice. Republishing full article text without permission is not. When in doubt, link back to the source.

How News Scraping Works: The Section-to-Article Pipeline

Most news scraping jobs follow the same shape. You start at a page that lists many articles, collect the links, then visit each one. Here is the flow.

News Scraping Pipeline

Section / Index Page

↓

Extract Article URLs (schema)

↓

Visit Each Article URL

↓

Extract Headline + Byline + Date + Body

↓

Store / Export (JSON, CSV, DB)

We'll build each stage in the next four sections. First the index, then the section page, then the article, then a harder publisher. By the end you can chain them into a single pipeline.

Three Ways to Extract Data

Browserbeam gives you three extraction styles. Pick the one that fits the page.

Method	How It Works	Best For	Stability
`extract` with CSS	Targets selectors and returns typed JSON	Sites with stable classes or `data-testid`	High when selectors are stable
`extract` with AI selectors	Describes the field in plain language (`ai >> headline`)	Messy layouts, hashed class names	High, slower and uses AI credits
`observe`	Returns the whole page as clean markdown	Quick exploration, paywalled or unknown layouts	Very high, less structured

You'll see all three in this guide. CSS for the clean sites, AI selectors for the messy ones, and observe for sites that block structured extraction.

Step 1: Scrape a News Aggregator (Hacker News)

We start with Hacker News because its HTML is clean and stable. Each story row is an element with the class athing. The rank lives in .rank, and the headline and link live inside .titleline > a.

The Quick Start above already scraped it. Let's look at why the schema works and how to extend it.

Reading the Extraction Schema

The schema is a nested dictionary. The _parent selector picks the repeating element. Every other key maps a field name to a selector plus an extractor. The >> text suffix pulls text content. The >> href suffix pulls the link and resolves it to an absolute URL automatically.

{
  "stories": [{
    "_parent": ".athing",
    "_limit": 10,
    "rank": ".rank >> text",
    "title": ".titleline > a >> text",
    "url": ".titleline > a >> href"
  }]
}

The _limit caps the number of rows. Drop it to pull every story on the page. Add "_require": ["title"] to skip any row where the title comes back empty, which keeps spacer rows out of your results.

Adding Points and Comments

Hacker News keeps the score and comment count in a separate .subtext row. You can pull those in the same call by widening the parent and adding fields. This is the pattern you reuse for any aggregator: find the repeating row, then name the fields you want.

A one-line takeaway: if a site has a clean repeating structure, CSS extraction is the fastest and cheapest path.

Step 2: Scrape a Publisher Section Page (NPR)

Aggregators are easy. Real publishers are where most news scraping happens. NPR is a good first publisher: its markup is reasonable, and the content is public. We'll scrape the main news section and pull every article link.

Two things change from the Hacker News example. We add a residential or datacenter proxy so the request looks like a normal reader, and we block images, fonts, and media to load the page faster.

curl -s -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.npr.org/sections/news/",
    "proxy": {"kind": "datacenter", "country": "us"},
    "block_resources": ["image", "font", "media"],
    "steps": [
      {"extract": {"articles": [{
        "_parent": "article.item",
        "_limit": 10,
        "title": ".title >> text",
        "url": "h2.title a >> href",
        "teaser": ".teaser a >> text"
      }]}},
      {"close": {}}
    ]
  }' | jq '.extraction.articles'

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(
    url="https://www.npr.org/sections/news/",
    proxy={"kind": "datacenter", "country": "us"},
    block_resources=["image", "font", "media"],
    steps=[
        {"extract": {"articles": [{
            "_parent": "article.item",
            "_limit": 10,
            "title": ".title >> text",
            "url": "h2.title a >> href",
            "teaser": ".teaser a >> text",
        }]}},
    ],
)

for article in session.extraction["articles"]:
    print(article["title"])
    print("  ", article["url"])
session.close()

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({
  url: "https://www.npr.org/sections/news/",
  proxy: { kind: "datacenter", country: "us" },
  block_resources: ["image", "font", "media"],
  steps: [
    { extract: { articles: [{
      _parent: "article.item",
      _limit: 10,
      title: ".title >> text",
      url: "h2.title a >> href",
      teaser: ".teaser a >> text",
    }] } },
  ],
});

const articles = session.extraction?.articles as Array<Record<string, string>>;
for (const article of articles) {
  console.log(article.title, article.url);
}
await session.close();

require "browserbeam"

client = Browserbeam::Client.new(api_key: "YOUR_API_KEY")
session = client.sessions.create(
  url: "https://www.npr.org/sections/news/",
  proxy: { kind: "datacenter", country: "us" },
  block_resources: ["image", "font", "media"],
  steps: [
    { extract: { articles: [{
      _parent: "article.item",
      _limit: 10,
      title: ".title >> text",
      url: "h2.title a >> href",
      teaser: ".teaser a >> text"
    }] } }
  ]
)

session.extraction["articles"].each do |article|
  puts article["title"]
  puts "  #{article['url']}"
end
session.close

You get back a list of articles with absolute URLs:

[
  {
    "title": "Meta plans to release AI-powered prediction market app, documents show",
    "url": "https://www.npr.org/2026/06/24/nx-s1-5869486/meta-prediction-market-app-ai",
    "teaser": "June 24, 2026 • The company is building an app separate from Facebook and Instagram..."
  }
]

Those URLs feed Step 3. This is the link-collection stage of the pipeline.

When CSS Classes Get in Your Way: AI Selectors

Some publishers ship markup with no useful classes, or classes that change on every deploy. Instead of reverse-engineering the structure, describe the field in plain language. Swap the CSS selectors for ai >> selectors and the extractor figures out the mapping.

{
  "articles": [{
    "_parent": "article",
    "_limit": 10,
    "title": "ai >> headline",
    "url": "ai >> article link",
    "date": "ai >> publish date"
  }]
}

AI selectors cost a few AI credits per call and run a little slower, but they survive layout changes that break CSS selectors. Use them when a site's markup is hostile or when you want one schema to work across several publishers.

Step 3: Extract Article Metadata and Body Text

Now we visit a single article and pull the details. Take a URL from the Step 2 results and scrape the headline, byline, publish date, and body paragraphs. NPR uses clear, stable selectors: h1 for the headline, time for the date, .byline__name for the author, and #storytext p for the body.

curl -s -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.npr.org/2026/06/24/nx-s1-5869486/meta-prediction-market-app-ai",
    "proxy": {"kind": "datacenter", "country": "us"},
    "block_resources": ["image", "font", "media"],
    "steps": [
      {"extract": {
        "headline": "h1 >> text",
        "date": "time >> text",
        "byline": ".byline__name >> text",
        "body": [{"_parent": "#storytext p", "_limit": 6, "text": "_parent >> text"}]
      }},
      {"close": {}}
    ]
  }' | jq '.extraction'

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")

article_url = "https://www.npr.org/2026/06/24/nx-s1-5869486/meta-prediction-market-app-ai"
session = client.sessions.create(
    url=article_url,
    proxy={"kind": "datacenter", "country": "us"},
    block_resources=["image", "font", "media"],
    steps=[
        {"extract": {
            "headline": "h1 >> text",
            "date": "time >> text",
            "byline": ".byline__name >> text",
            "body": [{"_parent": "#storytext p", "_limit": 6, "text": "_parent >> text"}],
        }},
    ],
)

data = session.extraction
print(data["headline"], "by", data["byline"])
print(data["date"])
for para in data["body"]:
    print(para["text"])
session.close()

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const articleUrl =
  "https://www.npr.org/2026/06/24/nx-s1-5869486/meta-prediction-market-app-ai";

const session = await client.sessions.create({
  url: articleUrl,
  proxy: { kind: "datacenter", country: "us" },
  block_resources: ["image", "font", "media"],
  steps: [
    { extract: {
      headline: "h1 >> text",
      date: "time >> text",
      byline: ".byline__name >> text",
      body: [{ _parent: "#storytext p", _limit: 6, text: "_parent >> text" }],
    } },
  ],
});

const data = session.extraction as Record<string, unknown>;
console.log(data.headline, "by", data.byline);
await session.close();

require "browserbeam"

client = Browserbeam::Client.new(api_key: "YOUR_API_KEY")
article_url = "https://www.npr.org/2026/06/24/nx-s1-5869486/meta-prediction-market-app-ai"

session = client.sessions.create(
  url: article_url,
  proxy: { kind: "datacenter", country: "us" },
  block_resources: ["image", "font", "media"],
  steps: [
    { extract: {
      headline: "h1 >> text",
      date: "time >> text",
      byline: ".byline__name >> text",
      body: [{ _parent: "#storytext p", _limit: 6, text: "_parent >> text" }]
    } }
  ]
)

data = session.extraction
puts "#{data['headline']} by #{data['byline']}"
data["body"].each { |para| puts para["text"] }
session.close

The result combines single fields and a list of paragraphs:

{
  "headline": "Meta plans to release AI-powered prediction market app, documents show",
  "date": "June 24, 2026 2:39 PM ET",
  "byline": "Bobby Allyn",
  "body": [
    { "text": "Meta is planning to launch its own prediction market app to compete with companies like Kalshi and Polymarket..." },
    { "text": "Meta CEO Mark Zuckerberg has instructed a team to start building a standalone app called Arena..." }
  ]
}

One nuance worth knowing: the first paragraph in #storytext is sometimes an image caption rather than the article lede. Drop the first item, or filter out short strings that end with "hide caption", and your body text stays clean.

Use the Article URL Pattern, Not a Hardcoded Link

Article URLs change daily, so do not hardcode them. Pull a fresh URL from the Step 2 section scrape, then feed it into this article scraper. That is the join between the two stages, and it is what makes the pipeline keep working as the news rotates.

Step 4: Scrape Major Publisher Headlines (BBC and The New York Times)

Larger publishers ship two extra hurdles: hashed class names and stronger anti-bot defenses. Here is how to handle both.

The BBC: Target `data-testid`, Not Hashed Classes

The BBC rebuilt its site as a React app. Its CSS class names change between deploys, so selectors like .sc-1185cb28-0 break within weeks. The fix is to target the data-testid attributes the BBC's own engineers added for testing. Those are stable.

Every story card is wrapped in an anchor with data-testid="internal-link", and the headline lives in an h2 with data-testid="card-headline". We pull the headline from the h2 and the URL from the anchor itself using the _parent token.

curl -s -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.bbc.com/news",
    "proxy": {"kind": "datacenter", "country": "us"},
    "block_resources": ["image", "font", "media"],
    "steps": [
      {"extract": {"headlines": [{
        "_parent": "a[data-testid=\"internal-link\"]",
        "_require": ["title"],
        "title": "h2[data-testid=\"card-headline\"] >> text",
        "url": "_parent >> href"
      }]}},
      {"close": {}}
    ]
  }' | jq '.extraction.headlines'

Two details make this reliable. The _parent >> href selector reads the href off the anchor that is the parent element, not a child. And _require: ["title"] drops every navigation and footer link that has no headline, which leaves only real story cards. Notice there is no _limit here: the require filter does the cleanup, so you get the full set of stories.

The result is clean, with absolute URLs:

[
  {
    "title": "Trump cancels signing of landmark bipartisan bill aimed at lowering housing costs",
    "url": "https://www.bbc.com/news/articles/cn4dwz5vz5lo"
  },
  {
    "title": "Congress passes war powers measure for first time, breaking with Trump over Iran",
    "url": "https://www.bbc.com/news/articles/ce8j6g3v3r4o"
  }
]

The New York Times: Observe When Extraction Gets Blocked

The New York Times runs some of the strictest anti-bot defenses of any publisher, and its homepage uses hashed class names like css-9mylee that rotate constantly. Structured CSS extraction is brittle here, and datacenter IPs get blocked outright. The reliable approach is residential proxies plus the observe step, which returns the rendered page as clean markdown.

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(
    url="https://www.nytimes.com",
    proxy={"kind": "residential", "country": "us"},
    block_resources=["image", "font", "media"],
    steps=[{"observe": {}}],
)

# The page comes back as markdown you can parse or feed to an LLM.
print(session.page.markdown.content[:1000])
session.close()

The markdown comes back with the section structure intact:

# New York Times - Top Stories

## Top Stories

Trump Stokes Chaos in Congress as He Huddles With the G.O.P.

President Trump scrapped plans to sign a major housing bill...

For the toughest sites, observe plus markdown is the most dependable path. You give up rigid field names, but you get the content reliably, and markdown is exactly what an LLM wants if your next step is summarization. Be honest with yourself about reliability here: the NYT will sometimes still rate-limit a request. Add retries with backoff and treat the occasional failure as normal.

Publisher Reliability at a Glance

Based on our testing in 2026, here is how a few common sources behave.

Source	Best Method	Proxy	Notes
Hacker News	CSS extract	None needed	Clean, stable HTML
NPR	CSS or AI extract	Datacenter	Stable markup, public content
BBC News	CSS via `data-testid`	Datacenter	Hashed classes, use test ids
The New York Times	`observe` markdown	Residential	Strong anti-bot, add retries

Advanced Patterns for News Pipelines

Once the basics work, these patterns turn a script into a dependable pipeline.

Chain Section and Article Scraping

The full pipeline is two stages joined by the article URLs. Scrape the section, then loop the URLs through the article scraper. Run the article scrapes in parallel sessions for throughput, with a short pause between batches to stay under rate limits.

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")

# Stage 1: collect article URLs from the section page.
section = client.sessions.create(
    url="https://www.npr.org/sections/news/",
    proxy={"kind": "datacenter", "country": "us"},
    block_resources=["image", "font", "media"],
    steps=[{"extract": {"articles": [{
        "_parent": "article.item",
        "_limit": 5,
        "url": "h2.title a >> href",
    }]}}],
)
urls = [a["url"] for a in section.extraction["articles"] if a.get("url")]
section.close()

# Stage 2: scrape each article for metadata and body.
articles = []
for url in urls:
    s = client.sessions.create(
        url=url,
        proxy={"kind": "datacenter", "country": "us"},
        block_resources=["image", "font", "media"],
        steps=[{"extract": {
            "headline": "h1 >> text",
            "date": "time >> text",
            "byline": ".byline__name >> text",
        }}],
    )
    articles.append(s.extraction)
    s.close()

print(f"Scraped {len(articles)} articles")

Export to CSV and JSON

Once you have a list of article dictionaries, exporting is plain Python. No API calls needed here, so this runs locally on the data you already collected.

import csv, json

with open("news.json", "w") as f:
    json.dump(articles, f, indent=2)

with open("news.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["headline", "date", "byline"])
    writer.writeheader()
    writer.writerows(articles)

Detect What Changed Since the Last Run

For monitoring, you do not want the whole front page every time. You want what is new. Store the article URLs you have already seen, then diff each new scrape against that set.

seen = set(open("seen.txt").read().split()) if __import__("os").path.exists("seen.txt") else set()
fresh = [a for a in articles if a.get("url") and a["url"] not in seen]

for a in fresh:
    print("NEW:", a.get("headline"))

with open("seen.txt", "a") as f:
    for a in fresh:
        f.write(a["url"] + "\n")

This turns a one-off scrape into a monitor. Run it on a schedule and you have a feed of only the new stories. For a deeper take on scheduled monitoring, see the competitive intelligence agent guide.

Real-World Use Cases

News scraping powers more than dashboards. Here are three concrete builds.

Topic Monitoring and Alerting

Track a topic across several publishers and get alerted when a new story lands. Scrape each section page on a schedule, filter headlines for your keywords, and push matches to Slack or email. The change-detection pattern above is the core of it.

News Datasets for NLP and LLMs

Building a sentiment model or fine-tuning a summarizer needs clean text. Scrape article bodies, strip captions and boilerplate, and store the headline, date, source, and body as JSONL. The structured output from the article scraper is close to training-ready already. The LLM training data pipeline guide covers the cleaning and formatting steps in depth.

A Personal News Aggregator

Pull headlines from your favorite sources into one feed, ranked however you like. This is the news-crawler use case: collect links from many sections, dedupe by URL, and store them in a small database. The web crawler guide shows how to manage a URL frontier and visited set when you scale beyond a handful of sources.

Common Mistakes When Scraping News

These are the errors that cost the most time. Each has a quick fix.

Mistake 1: Using an HTTP Client Instead of a Browser

requests or axios fetches the initial HTML, which on most news sites is a shell. Headlines render with JavaScript afterward, so you get an empty result or a bot-challenge page. Load the page in a real browser so the content is there before you extract.

Mistake 2: Hardcoding Hashed CSS Classes

Selectors like .css-9mylee or .sc-1185cb28-0 are generated at build time and change on the next deploy. Your scraper breaks silently. Target data-testid attributes, semantic tags like h1 and time, or use AI selectors that describe the field instead.

Mistake 3: Skipping Proxies on Big Publishers

Datacenter IPs are easy to fingerprint, and major publishers block them fast. If you see 403s or CAPTCHA pages, switch to residential proxies. For the difference between proxy types and when to use each, see the residential vs datacenter proxies guide.

Mistake 4: Treating the First Paragraph as the Lede

On many publishers the first element in the body container is an image caption or a photo credit, not the opening sentence. Drop the first paragraph or filter out short strings with credit text, and your dataset stays clean.

Mistake 5: Forgetting to Close Sessions

Every open session consumes runtime and keeps the billing clock running. Call session.close() when you finish, ideally in a finally block so it runs even when extraction fails. In a loop, close each session before starting the next.

Browserbeam vs DIY Playwright vs Scraping API Services

You have three real options for scraping news at scale. Here is the tradeoff.

Approach	What You Manage	Anti-Bot Handling	Output	Best For
DIY Playwright / Puppeteer	Browsers, proxies, retries, parsing	You build it all	Raw HTML to parse	Full control, you have the ops time
Generic HTML scraping API	Just the request	Built in	Raw HTML or a fixed parser	Simple page fetches
Browserbeam	Just the schema	Built in (residential proxies, stability)	Structured JSON or markdown	Structured news data with low ops

Pros of Browserbeam for news:

One call returns structured JSON, no HTML parsing layer to maintain
Residential proxies and stability detection are built in
AI selectors handle hashed class names that break CSS scrapers
observe markdown is a reliable fallback for the toughest sites

Cons of Browserbeam:

It is a paid API once you pass the free credits
For a single static page, a plain HTTP request is cheaper

If you already run browser infrastructure and enjoy maintaining it, Playwright is a fine choice. For most teams that want news data without an ops burden, a browser API earns its keep. For the broader build-versus-buy picture, the structured web scraping guide and the Python web scraping guide go deeper.

Frequently Asked Questions

How do I scrape news articles without getting blocked?

Load each page in a real browser rather than an HTTP client, and route requests through residential proxies for publishers with strong anti-bot defenses. A real browser passes the fingerprint checks that block headless scrapers, and residential IPs avoid the datacenter blocklists most news sites use.

Is it legal to scrape news articles?

Scraping public data is generally permitted, but the rules depend on the site's terms of service and your jurisdiction. Collect headlines and metadata for research or monitoring, respect robots.txt, and do not republish full paid articles or bypass paywalls. When in doubt, link back to the original source instead of copying the body.

What is the best way to scrape news headlines in Python?

Create a browser session, pass a CSS extraction schema with a _parent selector for the repeating story element, and read the structured result from session.extraction. The Python examples above show the full pattern for Hacker News, NPR, and the BBC.

How do I scrape a news site that loads content with JavaScript?

Use a cloud browser that renders the page before extraction. An HTTP request only sees the initial HTML shell, but a browser runs the JavaScript that injects the headlines, so the content is present when you extract it.

Can I scrape paywalled news articles?

You can usually scrape the free metadata and preview text that the publisher exposes, such as the headline, byline, date, and the opening lines. Do not circumvent the paywall to extract gated full text, since that breaks most terms of service and may infringe copyright.

How do I build a news crawler that follows links?

Scrape a section page to collect article URLs, store the URLs you have visited in a set to avoid duplicates, then visit each new URL to extract the article. This section-to-article loop is the core of a news crawler. The web crawler guide covers the URL frontier and politeness rules for larger crawls.

How often does a news scraper break?

CSS-based scrapers break whenever a publisher changes its markup, which for big sites can be every few weeks. Targeting data-testid attributes and semantic tags reduces breakage, and AI selectors that describe fields in plain language survive most layout changes.

What output format is best for a news dataset?

JSONL works well for one article per line, which is convenient for streaming into NLP pipelines and LLM training. Store the headline, source, publish date, URL, and cleaned body text per record, and keep the raw markdown if you might re-process it later.

Start Scraping News Today

We built the full stack: a one-call headline scraper, a section scraper that collects article URLs, an article scraper for headline, byline, date, and body, and a headline scraper for big publishers that survives class-name churn. Then we chained them into a section-to-article pipeline with CSV, JSON, and change-detection export.

The key insight for news: a real browser plus the right proxy solves most of the hard problems before you write a single selector. Once the page loads like a reader sees it, extraction is just naming the fields you want.

Try pointing the section scraper at a different publisher. Swap in your own keywords for the topic monitor. Change the _limit to pull more stories, or drop it and add _require to filter the noise. The same schema pattern works across most public news sites.

For the full API reference, see the Browserbeam documentation. If your next step is summarizing what you scrape, the structured web scraping guide goes deeper on schemas and the observe-versus-extract tradeoff. What source will you scrape first?