Web Scraping with Node.js: Libraries, Tools and Code Examples

May 08, 2026 20 min read

Node.js web scraping tools: Cheerio, Puppeteer, Playwright, Axios, and Browserbeam with the Node.js logo

By the end of this guide, you'll have working Node.js scrapers that fetch pages, parse HTML, extract structured data, handle JavaScript-rendered content, and paginate through entire sites. We'll build each one from scratch, using real websites and real selectors you can copy and run.

Node.js has a deep scraping toolchain. Axios and Cheerio handle static HTML. Puppeteer and Playwright spin up real browsers for JavaScript-heavy pages. Cloud browser APIs like Browserbeam let you skip the browser management entirely and get structured data back from a single API call. The challenge isn't finding libraries. It's knowing which one fits your use case and how to avoid the async pitfalls that trip up most Node.js scrapers.

This guide covers the full range: from a five-line Cheerio script to production-grade scrapers with proxies, retries, and concurrent page fetching. Every code example runs against real, publicly accessible websites. Copy, paste, run.

What you'll learn in this guide:

  • How to fetch and parse HTML with Axios and Cheerio
  • How to scrape JavaScript-rendered pages with Puppeteer, Playwright, and Browserbeam
  • How to extract structured data using CSS selectors and extraction schemas
  • How to handle pagination, infinite scroll, and multi-page scraping
  • How to use proxies, manage cookies, and implement retries
  • Three real-world scraping projects you can build today
  • How to choose the right Node.js scraping library for your use case

TL;DR: Node.js web scraping starts with Axios + Cheerio for static HTML, but most modern sites need a real browser. Puppeteer and Playwright handle JavaScript rendering locally. Cloud browser APIs like Browserbeam handle it remotely with structured output and no browser management. This guide walks through both approaches with working code, then covers pagination, proxies, concurrency, and three real-world projects.


What Is Web Scraping and Why Node.js?

Web scraping is how you turn web pages into structured data. A script fetches a page, finds the elements you care about, and pulls out the text, links, or attributes. Prices, job listings, headlines, product reviews. If it's on a web page, you can scrape it.

The basic loop is always the same: fetch the page, parse the HTML, extract the target data. Everything else (pagination, JavaScript rendering, proxy rotation) builds on top of that.

How Web Scraping Works

Every web page is an HTML document. Your scraper sends an HTTP request, receives that HTML, and uses a parser to locate specific elements. CSS selectors tell the parser which elements contain the data you want. The scraper reads the text or attributes from those elements and saves the result.

The wrinkle: modern websites load data with JavaScript after the initial HTML arrives. For those sites, a simple HTTP request returns an empty shell. You need a tool that executes JavaScript, which means a headless browser or a browser API.

For a deeper walkthrough of scraping fundamentals, see our Python web scraping guide. The concepts are identical across languages.

Why Node.js for Web Scraping

Node.js runs JavaScript natively, which gives it a unique advantage when scraping JavaScript-heavy sites. You're writing in the same language the page uses. Debugging selectors, understanding DOM APIs, and working with JSON responses all feel natural.

The async I/O model is a strong fit for scraping. While one request waits for a response, your scraper can fire off others. Promise.all and async/await make concurrent scraping straightforward without threads or multiprocessing.

The npm registry has over 2 million packages. The scraping-specific ones (Cheerio, Puppeteer, Playwright, got-scraping) are mature, well-maintained, and widely used. TypeScript support is first-class across all of them, which catches selector typos and data shape mismatches at compile time.

If your stack is already JavaScript or TypeScript, there's no reason to context-switch to another language for scraping. The tools are here, and they're good.

Node.js Web Scraping Libraries Compared

Before we write any code, let's look at the options. Each library fills a different niche, and picking the wrong one wastes time.

Axios + Cheerio

Axios handles HTTP requests. Cheerio parses the HTML response using a jQuery-like API. Together, they're the fastest way to scrape static HTML pages.

Strengths:
- Lightweight. No browser, no Chromium download. Installs in seconds.
- Fast. Parsing HTML in memory is orders of magnitude faster than launching a browser.
- Familiar API. If you've used jQuery, you already know Cheerio.

Limitations:
- No JavaScript execution. Pages that render content client-side return empty HTML.
- No cookie/session management built in (you handle it manually with Axios interceptors).
- No interaction. You can't click buttons, fill forms, or scroll.

Puppeteer

Puppeteer is Google's Node.js library for controlling Chrome/Chromium. It launches a real browser, navigates to pages, executes JavaScript, and gives you access to the rendered DOM.

Strengths:
- Full JavaScript execution. Every page renders exactly as a user would see it.
- Rich API for interaction: clicking, typing, screenshotting, PDF generation.
- Battle-tested. Used in production by thousands of teams since 2017.

Limitations:
- Downloads Chromium (~300MB) on install.
- Resource-heavy. Each browser instance uses 200-500MB of RAM.
- Chrome-only (Chromium, technically). No Firefox or WebKit support.
- You manage the browser lifecycle: launching, crashing, zombie processes.

Playwright

Playwright is Microsoft's multi-browser automation library. It supports Chromium, Firefox, and WebKit from a single API.

Strengths:
- Multi-browser support. Test and scrape across Chrome, Firefox, and Safari engines.
- Auto-wait. Built-in smart waiting that reduces flaky selectors.
- Browser contexts for isolated sessions without full browser restarts.
- Active development and strong TypeScript types.

Limitations:
- Downloads all three browser engines (~600MB+) by default.
- Resource-heavy, same as Puppeteer.
- More complex API surface than Puppeteer for simple scraping tasks.
- You still manage browser infrastructure in production.

Browserbeam

Browserbeam is a cloud browser API. You send HTTP requests, Browserbeam runs the browser remotely, and returns structured data. No browser to install, no infrastructure to manage.

Strengths:
- No local browser. No Chromium downloads, no Docker containers, no crash handling.
- Structured output. Get markdown, page maps, and extraction results instead of raw HTML.
- Built-in stability detection. The API waits for the page to finish loading automatically.
- Auto-dismiss blockers. Cookie banners and popups handled for you.
- Extraction schemas. Describe the data shape you want, get JSON back.
- Proxy support (datacenter and residential) built into the API.

Limitations:
- Requires internet connectivity. Not for offline development.
- API latency is higher than local browser calls (~2-4 seconds per request vs sub-second locally).
- Newer product with a smaller community than Puppeteer or Playwright.
- No raw CDP access (by design; the API abstracts the browser).

Library Comparison

Feature Axios + Cheerio Puppeteer Playwright Browserbeam
JavaScript execution No Yes Yes Yes
Browser download None ~300MB ~600MB None
RAM per instance ~20MB 200-500MB 200-500MB 0 (cloud)
Output format Raw HTML Raw HTML + DOM Raw HTML + DOM Structured markdown + JSON
Auto-wait / stability No Manual Built-in Built-in
Multi-browser N/A Chrome only Chrome, Firefox, WebKit Chrome (cloud)
Proxy support Manual Manual Manual Built-in
Cookie banner handling Manual Manual Manual Automatic
TypeScript types Yes Yes Yes Yes
Best for Static HTML pages JS-heavy scraping, local Multi-browser, testing AI agents, production scraping

Setting Up Your Node.js Scraping Environment

Let's set up a project. We'll install the core libraries, create a clean project structure, and optionally add TypeScript.

Installing Core Libraries

Start with a new project directory and initialize it. We'll use ES modules (import syntax) and top-level await, so set "type": "module" in your package.json:

mkdir node-scraper && cd node-scraper
npm init -y
npm pkg set type=module
npm install axios cheerio

For browser-based scraping, add one of these:

# Puppeteer (downloads Chromium automatically)
npm install puppeteer

# OR Playwright (downloads all browsers)
npm install playwright

# OR Browserbeam SDK (no browser download)
npm install @browserbeam/sdk

Project Structure for Scraping Scripts

Keep things simple. One file per scraper, a shared utilities module, and an output directory:

node-scraper/
├── package.json
├── scrapers/
│   ├── books.js          # Static HTML scraper
│   ├── quotes.js         # JS-rendered scraper
│   └── headlines.js      # News scraper
├── utils/
│   ├── retry.js          # Retry logic
│   └── export.js         # JSON/CSV export helpers
└── output/               # Scraped data goes here

TypeScript Setup

TypeScript catches selector typos and data shape bugs at compile time. If you're building anything beyond a quick throwaway script, it's worth the five-minute setup.

npm install -D typescript @types/node
npx tsc --init

Set "module": "nodenext" and "moduleResolution": "nodenext" in your tsconfig.json. All the libraries we're using ship with TypeScript declarations, so you get autocomplete and type checking out of the box.

Your First Node.js Web Scraper: Step by Step

Let's build a scraper that extracts book titles, prices, and availability from Books to Scrape. This is a static HTML site, so Axios + Cheerio is all we need.

Step 1: Fetch a Web Page with Axios

import axios from "axios";

const response = await axios.get("https://books.toscrape.com");
console.log(response.status); // 200
console.log(response.data.length); // ~50000 characters of HTML

Axios returns the full HTML as a string in response.data. For static sites, this is all the fetching you need.

Step 2: Parse HTML with Cheerio

import * as cheerio from "cheerio";

const $ = cheerio.load(response.data);

The $ function works like jQuery. You pass CSS selectors and get back a Cheerio object with methods to extract text, attributes, and HTML.

Step 3: Extract Data with CSS Selectors

const books = [];

$("article.product_pod").each((index, element) => {
  const title = $(element).find("h3 a").attr("title");
  const price = $(element).find(".price_color").text();
  const inStock = $(element).find(".instock.availability").text().trim();
  const url = $(element).find("h3 a").attr("href");

  books.push({ title, price, inStock, url });
});

console.log(books.length); // 20
console.log(books[0]);
// { title: "A Light in the Attic", price: "£51.77", inStock: "In stock", url: "catalogue/a-light-in-the-..." }

Each article.product_pod on the page represents one book. We use .find() to locate child elements and .text() or .attr() to pull out the data.

Step 4: Export to JSON or CSV

import { writeFileSync } from "fs";

// JSON
writeFileSync("output/books.json", JSON.stringify(books, null, 2));

// CSV
const header = "title,price,inStock,url";
const rows = books.map(b =>
  `"${b.title}","${b.price}","${b.inStock}","${b.url}"`
);
writeFileSync("output/books.csv", [header, ...rows].join("\n"));

Here's the complete scraper in one file:

import axios from "axios";
import * as cheerio from "cheerio";
import { writeFileSync } from "fs";

const response = await axios.get("https://books.toscrape.com");
const $ = cheerio.load(response.data);

const books = [];
$("article.product_pod").each((index, element) => {
  books.push({
    title: $(element).find("h3 a").attr("title"),
    price: $(element).find(".price_color").text(),
    inStock: $(element).find(".instock.availability").text().trim(),
    url: $(element).find("h3 a").attr("href"),
  });
});

writeFileSync("output/books.json", JSON.stringify(books, null, 2));
console.log(`Scraped ${books.length} books`);

Twenty lines. That's the baseline. Everything from here builds on this pattern.

Scraping JavaScript-Heavy Sites with Node.js

The Cheerio approach works until it doesn't. Open Quotes to Scrape in your browser and you'll see quotes. Fetch it with Axios and you'll get an empty page. The quotes are loaded by JavaScript after the initial HTML arrives.

Why Static Scrapers Fail on Modern Sites

When Axios fetches a page, it downloads the raw HTML. No JavaScript runs. If the page uses React, Vue, Angular, or any client-side rendering, the data simply isn't in the HTML response. You'll get a <div id="root"></div> and nothing else.

The fix: use a tool that executes JavaScript. That means a headless browser (Puppeteer, Playwright) or a cloud browser API (Browserbeam).

Scraping with Puppeteer

import puppeteer from "puppeteer";

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

await page.goto("https://quotes.toscrape.com/js/", {
  waitUntil: "networkidle2",
});

const quotes = await page.evaluate(() => {
  return Array.from(document.querySelectorAll(".quote")).map(el => ({
    text: el.querySelector(".text")?.textContent,
    author: el.querySelector(".author")?.textContent,
  }));
});

console.log(quotes.length); // 10
console.log(quotes[0]);
// { text: ""The world as we have created it...", author: "Albert Einstein" }

await browser.close();

Puppeteer launches a real Chromium instance, waits for the page to finish loading, then runs page.evaluate() to extract data from the live DOM. The waitUntil: "networkidle2" option tells Puppeteer to wait until there are no more than 2 active network connections for 500ms.

Scraping with Playwright

import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();

await page.goto("https://quotes.toscrape.com/js/");
await page.waitForSelector(".quote");

const quotes = await page.evaluate(() => {
  return Array.from(document.querySelectorAll(".quote")).map(el => ({
    text: el.querySelector(".text")?.textContent,
    author: el.querySelector(".author")?.textContent,
  }));
});

console.log(quotes.length); // 10
await browser.close();

Playwright's waitForSelector is often more reliable than Puppeteer's waitUntil options. It waits until the specific element appears in the DOM, which is exactly what you want when scraping dynamically loaded content.

Scraping with a Cloud Browser API

With Browserbeam, there is no browser to launch or manage. You send an API request and get structured data back.

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({
  url: "https://quotes.toscrape.com/js/",
});

// The page is already loaded and stable. Extract structured data:
await session.extract({
  quotes: [{ _parent: ".quote", text: ".text >> text", author: ".author >> text" }],
});

console.log(session.extraction.quotes);
// [{ text: "The world as we have created it...", author: "Albert Einstein" }, ...]

await session.close();

No Chromium download. No waitUntil guessing. No page.evaluate(). The API handles JavaScript execution, stability detection, and data extraction. You describe the data shape you want, and get JSON back.

The tradeoff is latency (API call vs local browser) and cost (Browserbeam uses a credit system). For production scrapers that need to run reliably without infrastructure management, the tradeoff usually makes sense.

Approach Lines of code Setup time JS rendering Output format
Axios + Cheerio 15 Seconds No Raw HTML
Puppeteer 20 Minutes (Chromium download) Yes Raw DOM
Playwright 20 Minutes (browser download) Yes Raw DOM
Browserbeam 10 Seconds (API key) Yes Structured JSON

Handling Pagination, Multiple Pages, and Infinite Scroll

A single page rarely has all the data you need. Let's handle the three most common multi-page patterns.

Books to Scrape has 50 pages. Each page has a "next" button linking to the next page. Here's how to follow them all:

import axios from "axios";
import * as cheerio from "cheerio";

let url = "https://books.toscrape.com";
const allBooks = [];

while (url) {
  const response = await axios.get(url);
  const $ = cheerio.load(response.data);

  $("article.product_pod").each((i, el) => {
    allBooks.push({
      title: $(el).find("h3 a").attr("title"),
      price: $(el).find(".price_color").text(),
    });
  });

  const nextBtn = $("li.next a");
  url = nextBtn.length ? new URL(nextBtn.attr("href"), url).href : null;

  console.log(`Scraped page, total: ${allBooks.length} books`);
}

console.log(`Done. ${allBooks.length} books total.`); // 1000 books

The key pattern: after scraping each page, look for the "next" link. If it exists, build the full URL and continue. If not, stop.

Offset and Cursor-Based Pagination

Some APIs and sites use offset parameters (?page=2) or cursor tokens (?after=abc123). The loop is the same, just with URL parameter manipulation:

let page = 1;
const allItems = [];

while (true) {
  let response;
  try {
    response = await axios.get(`https://books.toscrape.com/catalogue/page-${page}.html`);
  } catch (error) {
    break; // 404 or network error means no more pages
  }

  const $ = cheerio.load(response.data);

  const items = [];
  $("article.product_pod").each((i, el) => {
    items.push({ title: $(el).find("h3 a").attr("title") });
  });

  if (items.length === 0) break;

  allItems.push(...items);
  page++;
}

console.log(`Scraped ${page - 1} pages, ${allItems.length} items`);

Infinite Scroll Pages

Infinite scroll pages load more content as you scroll down. No next button, no page parameter. You need a real browser that can scroll.

With Browserbeam, the scrollCollect method handles this automatically:

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({
  url: "https://quotes.toscrape.com/scroll",
});

await session.scrollCollect({ max_scrolls: 10, wait_ms: 1000 });

await session.extract({
  quotes: [{ _parent: ".quote", text: ".text >> text", author: ".author >> text" }],
});

console.log(session.extraction.quotes.length); // Up to 100 quotes
await session.close();

scrollCollect scrolls the page repeatedly, waiting for new content to load after each scroll. It stops when no new content appears or it hits the max_scrolls limit. All the lazy-loaded content is in the DOM by the time you call extract.

With Puppeteer, you'd write the scroll loop manually:

import puppeteer from "puppeteer";

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://quotes.toscrape.com/scroll");

let previousHeight = 0;
for (let i = 0; i < 10; i++) {
  await page.evaluate("window.scrollTo(0, document.body.scrollHeight)");
  await new Promise(r => setTimeout(r, 1500));

  const newHeight = await page.evaluate("document.body.scrollHeight");
  if (newHeight === previousHeight) break;
  previousHeight = newHeight;
}

const quotes = await page.evaluate(() =>
  Array.from(document.querySelectorAll(".quote")).map(el => ({
    text: el.querySelector(".text")?.textContent,
    author: el.querySelector(".author")?.textContent,
  }))
);

console.log(quotes.length);
await browser.close();

Both work. The Browserbeam version is shorter and doesn't require tuning scroll timing. The Puppeteer version gives you full control over the scroll behavior.

Advanced Node.js Web Scraping Patterns

Once your basic scraper works, these patterns make it production-ready.

Using Proxies for Web Scraping

Proxies rotate your IP address so target sites don't block you after too many requests from the same source.

With Axios, set the proxy in the request config:

const response = await axios.get("https://books.toscrape.com", {
  proxy: {
    host: "proxy.example.com",
    port: 8080,
    auth: { username: "user", password: "pass" },
  },
});

With Browserbeam, proxy support is built into the session:

const session = await client.sessions.create({
  url: "https://books.toscrape.com",
  proxy: { kind: "residential", country: "US" },
});

No proxy provider setup. No IP rotation logic. The API handles it. For more on proxy types and when to use each one, see our guide to residential vs datacenter proxies.

Managing Cookies and Sessions

Some sites require login or track state with cookies. The typical pattern: fetch the login page to get a CSRF token, submit credentials with that token, then forward the session cookie on subsequent requests.

import axios from "axios";
import * as cheerio from "cheerio";

// Step 1: Fetch the login page to get the CSRF token and session cookie
const loginPage = await axios.get("https://quotes.toscrape.com/login");
const $ = cheerio.load(loginPage.data);
const csrfToken = $('input[name="csrf_token"]').val();
const initialCookies = loginPage.headers["set-cookie"];

// Step 2: Submit credentials with the CSRF token
const loginResponse = await axios.post(
  "https://quotes.toscrape.com/login",
  `csrf_token=${csrfToken}&username=admin&password=admin`,
  {
    headers: {
      "Content-Type": "application/x-www-form-urlencoded",
      Cookie: initialCookies?.join("; "),
    },
    maxRedirects: 0,
    validateStatus: status => status >= 200 && status < 400,
  }
);

// Step 3: Use the authenticated session cookie for subsequent requests
const sessionCookies = loginResponse.headers["set-cookie"] || initialCookies;
const pageResponse = await axios.get("https://quotes.toscrape.com", {
  headers: { Cookie: sessionCookies?.join("; ") },
});

With Puppeteer or Playwright, cookies persist automatically within a browser context, so you just navigate to the login page, fill the form, and submit. With Browserbeam, sessions maintain state across multiple API calls until you close them.

Rate Limiting and Retries

Hammering a site with hundreds of requests per second will get you blocked. Add a delay between requests and retry on failures:

function delay(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function fetchWithRetry(url, retries = 3, delayMs = 1000) {
  for (let attempt = 1; attempt <= retries; attempt++) {
    try {
      const response = await axios.get(url, { timeout: 10000 });
      return response;
    } catch (error) {
      console.log(`Attempt ${attempt} failed: ${error.message}`);
      if (attempt === retries) throw error;
      await delay(delayMs * attempt);
    }
  }
}

// Use it with a delay between pages
for (const url of urls) {
  const response = await fetchWithRetry(url);
  // ... process response
  await delay(1000); // 1 second between requests
}

A good rule of thumb: 1-2 seconds between requests for small sites, 3-5 seconds for larger ones. Always check robots.txt for crawl delay directives.

Concurrent Scraping with Promise.all

When you need speed and the target site can handle it, scrape multiple pages in parallel:

const urls = Array.from({ length: 50 }, (_, i) =>
  `https://books.toscrape.com/catalogue/page-${i + 1}.html`
);

const BATCH_SIZE = 5;
const allBooks = [];

for (let i = 0; i < urls.length; i += BATCH_SIZE) {
  const batch = urls.slice(i, i + BATCH_SIZE);
  const results = await Promise.all(
    batch.map(async url => {
      const response = await axios.get(url);
      const $ = cheerio.load(response.data);
      const books = [];
      $("article.product_pod").each((_, el) => {
        books.push({
          title: $(el).find("h3 a").attr("title"),
          price: $(el).find(".price_color").text(),
        });
      });
      return books;
    })
  );

  allBooks.push(...results.flat());
  console.log(`Batch done. Total: ${allBooks.length} books`);
  await delay(1000); // Pause between batches
}

Five concurrent requests per batch, with a pause between batches. This scrapes all 50 pages of Books to Scrape in about 10 seconds instead of 50.

Three Real-World Node.js Scraping Projects

Theory is useful. Working projects are better. Here are three scrapers you can run right now.

Project 1: Price Monitoring Bot

Track book prices on Books to Scrape and save the results with timestamps:

import Browserbeam from "@browserbeam/sdk";
import { writeFileSync, existsSync, readFileSync } from "fs";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });

async function checkPrices() {
  const session = await client.sessions.create({
    url: "https://books.toscrape.com",
  });

  await session.extract({
    books: [{
      _parent: "article.product_pod",
      title: "h3 a >> text",
      price: ".price_color >> text",
      inStock: ".instock.availability >> text",
    }],
  });

  const snapshot = {
    timestamp: new Date().toISOString(),
    books: session.extraction.books,
  };

  const historyFile = "output/price-history.json";
  const history = existsSync(historyFile)
    ? JSON.parse(readFileSync(historyFile, "utf-8"))
    : [];

  history.push(snapshot);
  writeFileSync(historyFile, JSON.stringify(history, null, 2));
  console.log(`Tracked ${snapshot.books.length} prices at ${snapshot.timestamp}`);

  await session.close();
}

await checkPrices();

Run this on a cron schedule and you've got a price tracker. Compare timestamps to detect price drops.

Project 2: Job Board Aggregator

Scrape job listings from Fake Python Jobs:

import axios from "axios";
import * as cheerio from "cheerio";
import { writeFileSync } from "fs";

const response = await axios.get("https://realpython.github.io/fake-jobs/");
const $ = cheerio.load(response.data);

const jobs = [];
$(".card").each((i, el) => {
  jobs.push({
    title: $(el).find("h2.title").text().trim(),
    company: $(el).find("h3.company").text().trim(),
    location: $(el).find(".location").text().trim(),
    date: $(el).find("time").text().trim(),
    link: $(el).find("a[href*='jobs']").last().attr("href"),
  });
});

writeFileSync("output/jobs.json", JSON.stringify(jobs, null, 2));
console.log(`Found ${jobs.length} job listings`);

This is a static HTML site, so Cheerio handles it in under a second. Add filtering by title or location to build a job alert system.

Project 3: News Headline Tracker

Scrape the front page of Hacker News for top stories:

import axios from "axios";
import * as cheerio from "cheerio";
import { writeFileSync } from "fs";

const response = await axios.get("https://news.ycombinator.com");
const $ = cheerio.load(response.data);

const stories = [];
$(".athing").each((i, el) => {
  const rank = $(el).find(".rank").text().replace(".", "");
  const titleEl = $(el).find(".titleline > a");
  const headline = titleEl.text();
  const url = titleEl.attr("href");

  stories.push({ rank: parseInt(rank), headline, url });
});

writeFileSync("output/hackernews.json", JSON.stringify(stories, null, 2));
console.log(`Scraped ${stories.length} headlines from Hacker News`);

Hacker News is server-rendered HTML, so Cheerio works perfectly. Run this hourly to track trending topics, or compare snapshots to find stories that climbed the rankings.

Common Node.js Web Scraping Mistakes

These are the five pitfalls that trip up most Node.js scrapers. Each one has burned me at least once.

Mistake 1: Ignoring robots.txt

The mistake: Scraping a site without checking what's allowed.

Why it matters: robots.txt tells crawlers which paths are off-limits. Ignoring it can get your IP blocked, and for some sites, it creates legal exposure.

The fix: Check robots.txt before scraping. It's always at the root: https://example.com/robots.txt. Respect Disallow directives and Crawl-delay values. Libraries like robots-parser can parse the file programmatically.

Mistake 2: Not Handling Async Properly

The mistake: Mixing callbacks, promises, and async/await inconsistently, or forgetting to await a promise.

Why it matters: Unawaited promises fail silently. Your scraper finishes with empty results and no error message. This is the most common debugging headache in Node.js scraping.

The fix: Use async/await everywhere. Never mix patterns. If a function returns a promise, await it. Add try/catch around async operations. Enable the no-floating-promises ESLint rule if you're using TypeScript.

// Bad: missing await, scraper ends before data arrives
const data = fetchPage(url); // Returns a Promise, not data

// Good: await the promise
const data = await fetchPage(url);

Mistake 3: Skipping Error Handling and Retries

The mistake: Assuming every HTTP request will succeed on the first try.

Why it matters: Networks are unreliable. Servers return 429 (rate limited), 503 (overloaded), or just time out. Without retry logic, one failed request kills your entire scrape.

The fix: Wrap requests in a retry function with exponential backoff. See the fetchWithRetry function in the Advanced Patterns section above.

Mistake 4: Hardcoding Selectors Without Validation

The mistake: Using .querySelector(".product-price").textContent without checking if the element exists.

Why it matters: Sites change their markup. A selector that worked yesterday returns null today, and null.textContent crashes your scraper.

The fix: Always check for null before accessing properties. Use optional chaining (?.) in page.evaluate() calls. With Cheerio, check .length before calling .text().

// Bad: crashes if element doesn't exist
const price = $(el).find(".price_color").text();

// Better: check before accessing
const priceEl = $(el).find(".price_color");
const price = priceEl.length ? priceEl.text() : null;

Mistake 5: Not Closing Browser Instances

The mistake: Launching Puppeteer or Playwright browsers without closing them when done or when errors occur.

Why it matters: Unclosed browsers leak memory. Run a scraper in a loop and you'll have dozens of zombie Chromium processes consuming all available RAM within minutes.

The fix: Always close browsers in a finally block:

const browser = await puppeteer.launch();
try {
  const page = await browser.newPage();
  // ... scraping logic
} finally {
  await browser.close();
}

With Browserbeam, call session.close() when you're done. The cloud handles the browser cleanup, but closing the session frees resources and stops the billing clock.

How to Choose the Right Node.js Scraping Library

Picking the right tool depends on three factors: does the page use JavaScript, how many pages are you scraping, and where does your scraper run?

Decision Framework: Which Node.js Scraping Tool?
Does the page render content with JavaScript?
NO → Static HTML
Axios + Cheerio
YES → Needs a browser
Do you want to manage browser infrastructure?
YES → Puppeteer or Playwright
NO → Browserbeam

When to Use Axios + Cheerio

Choose Axios + Cheerio when:
- The target site serves all content in the initial HTML (no client-side rendering)
- You need maximum speed (HTML parsing is 10-100x faster than browser rendering)
- You're scraping thousands of pages and want to minimize resource usage
- You don't need to interact with the page (clicking, scrolling, form filling)

Good fit: documentation sites, blogs, wikis, static e-commerce sites, government databases.

When to Use Puppeteer or Playwright

Choose a local browser when:
- The page loads content with JavaScript (React, Vue, Angular, AJAX calls)
- You need to interact with the page (click buttons, fill forms, scroll)
- You need screenshots or PDF generation
- You're comfortable managing browser infrastructure (Docker, crash handling, memory limits)
- You need raw CDP access for low-level browser control

Puppeteer vs Playwright: If you only need Chrome, Puppeteer is simpler. If you need multi-browser support, auto-wait, or browser contexts, choose Playwright. For a deeper comparison, see our Puppeteer vs Playwright vs Browserbeam analysis.

When to Use a Cloud Browser API

Choose Browserbeam when:
- You want structured output (markdown, JSON extraction) instead of raw HTML
- You don't want to manage browser infrastructure in production
- You need built-in proxy rotation and cookie banner dismissal
- You're building AI agents that consume web data (token-efficient structured output matters)
- You want to skip the Docker, Chromium downloads, and crash-handling boilerplate

Bad fit: offline development, sub-100ms latency requirements, raw CDP protocol access.

Scenario Best Tool Why
Scrape 10,000 static product pages Axios + Cheerio Fastest, lowest resource usage
Scrape a React SPA with infinite scroll Puppeteer or Browserbeam Needs JavaScript execution + scrolling
Build a price monitoring pipeline Browserbeam No infra, structured output, proxy built-in
Generate PDFs from web pages Puppeteer or Playwright Local control over rendering
Feed web data to an LLM agent Browserbeam Token-efficient structured output
Scrape across Chrome, Firefox, Safari Playwright Multi-browser from one API

Web Scraping: JavaScript vs Python

This is one of the most common questions, and the answer depends on your existing stack and what you're building.

Factor JavaScript / Node.js Python
Async scraping Native async/await, event loop asyncio or threading (extra setup)
Static HTML parsing Cheerio (jQuery-like) BeautifulSoup (Pythonic)
Browser automation Puppeteer, Playwright Selenium, Playwright, Browserbeam
Data analysis Limited (no pandas equivalent) pandas, numpy, Jupyter
Package registry npm (2M+ packages) PyPI (500K+ packages)
TypeScript First-class Type hints (optional, less enforced)
Learning curve for scraping Easy if you know JS Easy if you know Python
Community resources Growing Largest, most tutorials

Choose JavaScript when:
- Your team already works in JavaScript/TypeScript
- You're scraping JavaScript-heavy sites (you understand the DOM natively)
- You want to share code between your web app and your scrapers
- You need TypeScript's type system for complex data shapes
- Your scraping pipeline feeds into a Node.js backend

Choose Python when:
- You need to analyze scraped data (pandas, Jupyter, matplotlib)
- You're building ML pipelines that consume web data
- You want the largest selection of scraping tutorials and community support
- Your team is already Python-first

The honest answer: both languages scrape equally well. The best choice is the one your team already knows. If you're starting fresh and your only goal is scraping, Python has a slight edge in community resources. If you're a JavaScript developer, there's no reason to learn Python just for scraping. The Node.js tools are mature and capable.

For a full Python walkthrough, see our web scraping with Python guide.

Node.js Web Scraping Best Practices

Practice Why It Matters
Check robots.txt before scraping Avoids blocks and legal issues
Set a User-Agent header Many sites block requests without one
Add delays between requests (1-3 seconds) Prevents rate limiting and IP bans
Use try/catch and retry logic Networks fail; scrapers shouldn't
Close browser instances in finally blocks Prevents memory leaks
Validate selectors before extracting Sites change markup without warning
Export data incrementally Don't lose hours of work to a crash at page 999
Use TypeScript for complex scrapers Catches data shape errors at compile time
Rotate proxies for large-scale scraping Distributes requests across IPs
Log progress and errors Makes debugging failed scrapes possible

Frequently Asked Questions

How do I scrape a website with Node.js?

Install Axios and Cheerio for static HTML pages, or Puppeteer/Playwright for JavaScript-rendered sites. Fetch the page, parse the HTML with CSS selectors, and extract the data you need. See the step-by-step tutorial in this guide for a working example using Books to Scrape.

Is JavaScript good for web scraping?

Yes. Node.js has a mature scraping toolchain (Cheerio, Puppeteer, Playwright, Browserbeam), native async/await for concurrent requests, and the advantage of running the same language as the pages you're scraping. It's a strong choice for any team already working in JavaScript or TypeScript.

What is the best Node.js web scraping library?

It depends on your use case. Axios + Cheerio is best for static HTML. Puppeteer or Playwright is best for JavaScript-rendered pages. Browserbeam is best for production scraping without browser management. See the comparison table and decision framework in this guide.

Can I scrape JavaScript-rendered pages with Node.js?

Yes. Use Puppeteer, Playwright, or Browserbeam. All three execute JavaScript and give you access to the fully rendered DOM. Cheerio and Axios alone cannot scrape JavaScript-rendered content because they don't execute client-side scripts.

How do I build a web crawler in Node.js?

Start with a seed URL, scrape the page for data and links, add new links to a queue, and repeat. Use a Set to track visited URLs and avoid cycles. Add rate limiting and a maximum depth to prevent runaway crawling. The pagination examples in this guide show the basic pattern.

What is the difference between Cheerio and Puppeteer?

Cheerio parses static HTML strings with a jQuery API. It's fast and lightweight but doesn't execute JavaScript. Puppeteer controls a real Chromium browser that renders pages fully, including JavaScript. Use Cheerio for static sites and Puppeteer for dynamic ones.

Can I use TypeScript for web scraping?

Yes. All major Node.js scraping libraries (Cheerio, Puppeteer, Playwright, Browserbeam SDK) ship with TypeScript declarations. TypeScript adds type safety to your scraped data shapes, catches selector errors at compile time, and improves code maintainability for complex scraping projects.

Is web scraping with JavaScript faster than Python?

For HTTP-based scraping (Axios + Cheerio vs Requests + BeautifulSoup), performance is comparable. Node.js has an edge in concurrent scraping because its event loop handles thousands of simultaneous connections without threading overhead. For browser-based scraping (Puppeteer vs Selenium), performance depends more on the browser engine than the language. The real bottleneck is always the network and the target server's response time.

Conclusion

You've now got the full Node.js scraping toolkit. Axios + Cheerio for static pages. Puppeteer and Playwright for JavaScript-heavy sites. Browserbeam for production scraping without infrastructure headaches. Plus pagination, proxies, retries, and concurrent scraping patterns.

The best way to learn is to pick one of the three projects above and customize it. Change the target site. Adjust the selectors. Add error handling. Break things and fix them.

If you want to skip the browser management and get structured data from any web page with a single API call, try Browserbeam for free. The Node.js SDK installs in seconds and the API documentation covers every endpoint.

For more scraping guides, check out our structured web scraping guide and web scraping in 2026 overview. If you're scaling to thousands of pages, see scaling web automation.

What will you scrape first?

You might also like:

Give your AI agent a faster, leaner browser

Structured page data instead of raw HTML. Your agent processes less, decides faster, and costs less to run.

Stability detection built in
Fraction of the payload size
Diffs after every action
No credit card required. 5,000 free credits included.