Web Scraping with Node.js: Libraries, Tools and Code Examples

Node.js web scraping tools: Cheerio, Puppeteer, Playwright, Axios, and Browserbeam with the Node.js logo

By the end of this guide, you'll have working Node.js scrapers that fetch pages, parse HTML, extract structured data, handle JavaScript-rendered content, and paginate through entire sites. We'll build each one from scratch, using real websites and real selectors you can copy and run.

Node.js has a deep scraping toolchain. Axios and Cheerio handle static HTML. Puppeteer and Playwright spin up real browsers for JavaScript-heavy pages. Cloud browser APIs like Browserbeam let you skip the browser management entirely and get structured data back from a single API call. The challenge isn't finding libraries. It's knowing which one fits your use case and how to avoid the async pitfalls that trip up most Node.js scrapers.

This guide covers the full range: from a five-line Cheerio script to production-grade scrapers with proxies, retries, and concurrent page fetching. Every code example runs against real, publicly accessible websites. Copy, paste, run.

What you'll learn in this guide:

How to fetch and parse HTML with Axios and Cheerio
How to scrape JavaScript-rendered pages with Puppeteer, Playwright, and Browserbeam
How to extract structured data using CSS selectors and extraction schemas
How to handle pagination, infinite scroll, and multi-page scraping
How to use proxies, manage cookies, and implement retries
Three real-world scraping projects you can build today
How to choose the right Node.js scraping library for your use case

TL;DR: Node.js web scraping starts with Axios + Cheerio for static HTML, but most modern sites need a real browser. Puppeteer and Playwright handle JavaScript rendering locally. Cloud browser APIs like Browserbeam handle it remotely with structured output and no browser management. This guide walks through both approaches with working code, then covers pagination, proxies, concurrency, and three real-world projects.

What Is Web Scraping and Why Node.js?

Web scraping is how you turn web pages into structured data. A script fetches a page, finds the elements you care about, and pulls out the text, links, or attributes. Prices, job listings, headlines, product reviews. If it's on a web page, you can scrape it.

The basic loop is always the same: fetch the page, parse the HTML, extract the target data. Everything else (pagination, JavaScript rendering, proxy rotation) builds on top of that.

How Web Scraping Works

Every web page is an HTML document. Your scraper sends an HTTP request, receives that HTML, and uses a parser to locate specific elements. CSS selectors tell the parser which elements contain the data you want. The scraper reads the text or attributes from those elements and saves the result.

The wrinkle: modern websites load data with JavaScript after the initial HTML arrives. For those sites, a simple HTTP request returns an empty shell. You need a tool that executes JavaScript, which means a headless browser or a browser API.

For a deeper walkthrough of scraping fundamentals, see our Python web scraping guide. The concepts are identical across languages.

Why Node.js for Web Scraping

Node.js runs JavaScript natively, which gives it a unique advantage when scraping JavaScript-heavy sites. You're writing in the same language the page uses. Debugging selectors, understanding DOM APIs, and working with JSON responses all feel natural.

The async I/O model is a strong fit for scraping. While one request waits for a response, your scraper can fire off others. Promise.all and async/await make concurrent scraping straightforward without threads or multiprocessing.

The npm registry has over 2 million packages. The scraping-specific ones (Cheerio, Puppeteer, Playwright, got-scraping) are mature, well-maintained, and widely used. TypeScript support is first-class across all of them, which catches selector typos and data shape mismatches at compile time.

If your stack is already JavaScript or TypeScript, there's no reason to context-switch to another language for scraping. The tools are here, and they're good.

Node.js Web Scraping Libraries Compared

Before we write any code, let's look at the options. Each library fills a different niche, and picking the wrong one wastes time.

Axios + Cheerio

Axios handles HTTP requests. Cheerio parses the HTML response using a jQuery-like API. Together, they're the fastest way to scrape static HTML pages.

Strengths:
- Lightweight. No browser, no Chromium download. Installs in seconds.
- Fast. Parsing HTML in memory is orders of magnitude faster than launching a browser.
- Familiar API. If you've used jQuery, you already know Cheerio.

Limitations:
- No JavaScript execution. Pages that render content client-side return empty HTML.
- No cookie/session management built in (you handle it manually with Axios interceptors).
- No interaction. You can't click buttons, fill forms, or scroll.

Puppeteer

Puppeteer is Google's Node.js library for controlling Chrome/Chromium. It launches a real browser, navigates to pages, executes JavaScript, and gives you access to the rendered DOM.

Strengths:
- Full JavaScript execution. Every page renders exactly as a user would see it.
- Rich API for interaction: clicking, typing, screenshotting, PDF generation.
- Battle-tested. Used in production by thousands of teams since 2017.

Limitations:
- Downloads Chromium (~300MB) on install.
- Resource-heavy. Each browser instance uses 200-500MB of RAM.
- Chrome-only (Chromium, technically). No Firefox or WebKit support.
- You manage the browser lifecycle: launching, crashing, zombie processes.

Playwright

Playwright is Microsoft's multi-browser automation library. It supports Chromium, Firefox, and WebKit from a single API.

Strengths:
- Multi-browser support. Test and scrape across Chrome, Firefox, and Safari engines.
- Auto-wait. Built-in smart waiting that reduces flaky selectors.
- Browser contexts for isolated sessions without full browser restarts.
- Active development and strong TypeScript types.

Limitations:
- Downloads all three browser engines (~600MB+) by default.
- Resource-heavy, same as Puppeteer.
- More complex API surface than Puppeteer for simple scraping tasks.
- You still manage browser infrastructure in production.

Browserbeam

Browserbeam is a cloud browser API. You send HTTP requests, Browserbeam runs the browser remotely, and returns structured data. No browser to install, no infrastructure to manage.

Strengths:
- No local browser. No Chromium downloads, no Docker containers, no crash handling.
- Structured output. Get markdown, page maps, and extraction results instead of raw HTML.
- Built-in stability detection. The API waits for the page to finish loading automatically.
- Auto-dismiss blockers. Cookie banners and popups handled for you.
- Extraction schemas. Describe the data shape you want, get JSON back.
- Proxy support (datacenter and residential) built into the API.

Limitations:
- Requires internet connectivity. Not for offline development.
- API latency is higher than local browser calls (~2-4 seconds per request vs sub-second locally).
- Newer product with a smaller community than Puppeteer or Playwright.
- No raw CDP access (by design; the API abstracts the browser).

Library Comparison

Feature	Axios + Cheerio	Puppeteer	Playwright	Browserbeam
JavaScript execution	No	Yes	Yes	Yes
Browser download	None	~300MB	~600MB	None
RAM per instance	~20MB	200-500MB	200-500MB	0 (cloud)
Output format	Raw HTML	Raw HTML + DOM	Raw HTML + DOM	Structured markdown + JSON
Auto-wait / stability	No	Manual	Built-in	Built-in
Multi-browser	N/A	Chrome only	Chrome, Firefox, WebKit	Chrome (cloud)
Proxy support	Manual	Manual	Manual	Built-in
Cookie banner handling	Manual	Manual	Manual	Automatic
TypeScript types	Yes	Yes	Yes	Yes
Best for	Static HTML pages	JS-heavy scraping, local	Multi-browser, testing	AI agents, production scraping

Setting Up Your Node.js Scraping Environment

Let's set up a project. We'll install the core libraries, create a clean project structure, and optionally add TypeScript.

Installing Core Libraries

Start with a new project directory and initialize it. We'll use ES modules (import syntax) and top-level await, so set "type": "module" in your package.json:

mkdir node-scraper && cd node-scraper
npm init -y
npm pkg set type=module
npm install axios cheerio

For browser-based scraping, add one of these:

# Puppeteer (downloads Chromium automatically)
npm install puppeteer

# OR Playwright (downloads all browsers)
npm install playwright

# OR Browserbeam SDK (no browser download)
npm install @browserbeam/sdk

Project Structure for Scraping Scripts

Keep things simple. One file per scraper, a shared utilities module, and an output directory:

node-scraper/
├── package.json
├── scrapers/
│   ├── books.js          # Static HTML scraper
│   ├── quotes.js         # JS-rendered scraper
│   └── headlines.js      # News scraper
├── utils/
│   ├── retry.js          # Retry logic
│   └── export.js         # JSON/CSV export helpers
└── output/               # Scraped data goes here

TypeScript Setup

TypeScript catches selector typos and data shape bugs at compile time. If you're building anything beyond a quick throwaway script, it's worth the five-minute setup.

npm install -D typescript @types/node
npx tsc --init

Set "module": "nodenext" and "moduleResolution": "nodenext" in your tsconfig.json. All the libraries we're using ship with TypeScript declarations, so you get autocomplete and type checking out of the box.

Your First Node.js Web Scraper: Step by Step

Let's build a scraper that extracts book titles, prices, and availability from Books to Scrape. This is a static HTML site, so Axios + Cheerio is all we need.

Step 1: Fetch a Web Page with Axios

import axios from "axios";

const response = await axios.get("https://books.toscrape.com");
console.log(response.status); // 200
console.log(response.data.length); // ~50000 characters of HTML

Axios returns the full HTML as a string in response.data. For static sites, this is all the fetching you need.

Step 2: Parse HTML with Cheerio

import * as cheerio from "cheerio";

const $ = cheerio.load(response.data);

The $ function works like jQuery. You pass CSS selectors and get back a Cheerio object with methods to extract text, attributes, and HTML.

Step 3: Extract Data with CSS Selectors

const books = [];

$("article.product_pod").each((index, element) => {
  const title = $(element).find("h3 a").attr("title");
  const price = $(element).find(".price_color").text();
  const inStock = $(element).find(".instock.availability").text().trim();
  const url = $(element).find("h3 a").attr("href");

  books.push({ title, price, inStock, url });
});

console.log(books.length); // 20
console.log(books[0]);
// { title: "A Light in the Attic", price: "£51.77", inStock: "In stock", url: "catalogue/a-light-in-the-..." }

Each article.product_pod on the page represents one book. We use .find() to locate child elements and .text() or .attr() to pull out the data.

Step 4: Export to JSON or CSV

import { writeFileSync } from "fs";

// JSON
writeFileSync("output/books.json", JSON.stringify(books, null, 2));

// CSV
const header = "title,price,inStock,url";
const rows = books.map(b =>
  `"${b.title}","${b.price}","${b.inStock}","${b.url}"`
);
writeFileSync("output/books.csv", [header, ...rows].join("\n"));

Here's the complete scraper in one file:

import axios from "axios";
import * as cheerio from "cheerio";
import { writeFileSync } from "fs";

const response = await axios.get("https://books.toscrape.com");
const $ = cheerio.load(response.data);

const books = [];
$("article.product_pod").each((index, element) => {
  books.push({
    title: $(element).find("h3 a").attr("title"),
    price: $(element).find(".price_color").text(),
    inStock: $(element).find(".instock.availability").text().trim(),
    url: $(element).find("h3 a").attr("href"),
  });
});

writeFileSync("output/books.json", JSON.stringify(books, null, 2));
console.log(`Scraped ${books.length} books`);

Twenty lines. That's the baseline. Everything from here builds on this pattern.

Scraping JavaScript-Heavy Sites with Node.js

The Cheerio approach works until it doesn't. Open Quotes to Scrape in your browser and you'll see quotes. Fetch it with Axios and you'll get an empty page. The quotes are loaded by JavaScript after the initial HTML arrives.

Why Static Scrapers Fail on Modern Sites

When Axios fetches a page, it downloads the raw HTML. No JavaScript runs. If the page uses React, Vue, Angular, or any client-side rendering, the data simply isn't in the HTML response. You'll get a <div id="root"></div> and nothing else.

The fix: use a tool that executes JavaScript. That means a headless browser (Puppeteer, Playwright) or a cloud browser API (Browserbeam).

Scraping with Puppeteer

import puppeteer from "puppeteer";

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

await page.goto("https://quotes.toscrape.com/js/", {
  waitUntil: "networkidle2",
});

const quotes = await page.evaluate(() => {
  return Array.from(document.querySelectorAll(".quote")).map(el => ({
    text: el.querySelector(".text")?.textContent,
    author: el.querySelector(".author")?.textContent,
  }));
});

console.log(quotes.length); // 10
console.log(quotes[0]);
// { text: ""The world as we have created it...", author: "Albert Einstein" }

await browser.close();

Puppeteer launches a real Chromium instance, waits for the page to finish loading, then runs page.evaluate() to extract data from the live DOM. The waitUntil: "networkidle2" option tells Puppeteer to wait until there are no more than 2 active network connections for 500ms.

Scraping with Playwright

import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();

await page.goto("https://quotes.toscrape.com/js/");
await page.waitForSelector(".quote");

const quotes = await page.evaluate(() => {
  return Array.from(document.querySelectorAll(".quote")).map(el => ({
    text: el.querySelector(".text")?.textContent,
    author: el.querySelector(".author")?.textContent,
  }));
});

console.log(quotes.length); // 10
await browser.close();

Playwright's waitForSelector is often more reliable than Puppeteer's waitUntil options. It waits until the specific element appears in the DOM, which is exactly what you want when scraping dynamically loaded content.

Scraping with a Cloud Browser API

With Browserbeam, there is no browser to launch or manage. You send an API request and get structured data back.

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({
  url: "https://quotes.toscrape.com/js/",
});

// The page is already loaded and stable. Extract structured data:
await session.extract({
  quotes: [{ _parent: ".quote", text: ".text >> text", author: ".author >> text" }],
});

console.log(session.extraction.quotes);
// [{ text: "The world as we have created it...", author: "Albert Einstein" }, ...]

await session.close();

No Chromium download. No waitUntil guessing. No page.evaluate(). The API handles JavaScript execution, stability detection, and data extraction. You describe the data shape you want, and get JSON back.

The tradeoff is latency (API call vs local browser) and cost (Browserbeam uses a credit system). For production scrapers that need to run reliably without infrastructure management, the tradeoff usually makes sense.

Approach	Lines of code	Setup time	JS rendering	Output format
Axios + Cheerio	15	Seconds	No	Raw HTML
Puppeteer	20	Minutes (Chromium download)	Yes	Raw DOM
Playwright	20	Minutes (browser download)	Yes	Raw DOM
Browserbeam	10	Seconds (API key)	Yes	Structured JSON

Handling Pagination, Multiple Pages, and Infinite Scroll

A single page rarely has all the data you need. Let's handle the three most common multi-page patterns.

Following Next-Page Links

Books to Scrape has 50 pages. Each page has a "next" button linking to the next page. Here's how to follow them all:

import axios from "axios";
import * as cheerio from "cheerio";

let url = "https://books.toscrape.com";
const allBooks = [];

while (url) {
  const response = await axios.get(url);
  const $ = cheerio.load(response.data);

  $("article.product_pod").each((i, el) => {
    allBooks.push({
      title: $(el).find("h3 a").attr("title"),
      price: $(el).find(".price_color").text(),
    });
  });

  const nextBtn = $("li.next a");
  url = nextBtn.length ? new URL(nextBtn.attr("href"), url).href : null;

  console.log(`Scraped page, total: ${allBooks.length} books`);
}

console.log(`Done. ${allBooks.length} books total.`); // 1000 books

The key pattern: after scraping each page, look for the "next" link. If it exists, build the full URL and continue. If not, stop.

Offset and Cursor-Based Pagination

Some APIs and sites use offset parameters (?page=2) or cursor tokens (?after=abc123). The loop is the same, just with URL parameter manipulation:

let page = 1;
const allItems = [];

while (true) {
  let response;
  try {
    response = await axios.get(`https://books.toscrape.com/catalogue/page-${page}.html`);
  } catch (error) {
    break; // 404 or network error means no more pages
  }

  const $ = cheerio.load(response.data);

  const items = [];
  $("article.product_pod").each((i, el) => {
    items.push({ title: $(el).find("h3 a").attr("title") });
  });

  if (items.length === 0) break;

  allItems.push(...items);
  page++;
}

console.log(`Scraped ${page - 1} pages, ${allItems.length} items`);

Infinite Scroll Pages

Infinite scroll pages load more content as you scroll down. No next button, no page parameter. You need a real browser that can scroll.

With Browserbeam, the scrollCollect method handles this automatically:

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({
  url: "https://quotes.toscrape.com/scroll",
});

await session.scrollCollect({ max_scrolls: 10, wait_ms: 1000 });

await session.extract({
  quotes: [{ _parent: ".quote", text: ".text >> text", author: ".author >> text" }],
});

console.log(session.extraction.quotes.length); // Up to 100 quotes
await session.close();

scrollCollect scrolls the page repeatedly, waiting for new content to load after each scroll. It stops when no new content appears or it hits the max_scrolls limit. All the lazy-loaded content is in the DOM by the time you call extract.

With Puppeteer, you'd write the scroll loop manually:

import puppeteer from "puppeteer";

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://quotes.toscrape.com/scroll");

let previousHeight = 0;
for (let i = 0; i < 10; i++) {
  await page.evaluate("window.scrollTo(0, document.body.scrollHeight)");
  await new Promise(r => setTimeout(r, 1500));

  const newHeight = await page.evaluate("document.body.scrollHeight");
  if (newHeight === previousHeight) break;
  previousHeight = newHeight;
}

const quotes = await page.evaluate(() =>
  Array.from(document.querySelectorAll(".quote")).map(el => ({
    text: el.querySelector(".text")?.textContent,
    author: el.querySelector(".author")?.textContent,
  }))
);

console.log(quotes.length);
await browser.close();

Both work. The Browserbeam version is shorter and doesn't require tuning scroll timing. The Puppeteer version gives you full control over the scroll behavior.

Advanced Node.js Web Scraping Patterns

Once your basic scraper works, these patterns make it production-ready.

Using Proxies for Web Scraping

Proxies rotate your IP address so target sites don't block you after too many requests from the same source.

With Axios, set the proxy in the request config:

const response = await axios.get("https://books.toscrape.com", {
  proxy: {
    host: "proxy.example.com",
    port: 8080,
    auth: { username: "user", password: "pass" },
  },
});

With Browserbeam, proxy support is built into the session:

const session = await client.sessions.create({
  url: "https://books.toscrape.com",
  proxy: { kind: "residential", country: "US" },
});

No proxy provider setup. No IP rotation logic. The API handles it. For more on proxy types and when to use each one, see our guide to residential vs datacenter proxies.

Managing Cookies and Sessions

Some sites require login or track state with cookies. The typical pattern: fetch the login page to get a CSRF token, submit credentials with that token, then forward the session cookie on subsequent requests.

import axios from "axios";
import * as cheerio from "cheerio";

// Step 1: Fetch the login page to get the CSRF token and session cookie
const loginPage = await axios.get("https://quotes.toscrape.com/login");
const $ = cheerio.load(loginPage.data);
const csrfToken = $('input[name="csrf_token"]').val();
const initialCookies = loginPage.headers["set-cookie"];

// Step 2: Submit credentials with the CSRF token
const loginResponse = await axios.post(
  "https://quotes.toscrape.com/login",
  `csrf_token=${csrfToken}&username=admin&password=admin`,
  {
    headers: {
      "Content-Type": "application/x-www-form-urlencoded",
      Cookie: initialCookies?.join("; "),
    },
    maxRedirects: 0,
    validateStatus: status => status >= 200 && status < 400,
  }
);

// Step 3: Use the authenticated session cookie for subsequent requests
const sessionCookies = loginResponse.headers["set-cookie"] || initialCookies;
const pageResponse = await axios.get("https://quotes.toscrape.com", {
  headers: { Cookie: sessionCookies?.join("; ") },
});

With Puppeteer or Playwright, cookies persist automatically within a browser context, so you just navigate to the login page, fill the form, and submit. With Browserbeam, sessions maintain state across multiple API calls until you close them.

Rate Limiting and Retries

Hammering a site with hundreds of requests per second will get you blocked. Add a delay between requests and retry on failures:

function delay(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function fetchWithRetry(url, retries = 3, delayMs = 1000) {
  for (let attempt = 1; attempt <= retries; attempt++) {
    try {
      const response = await axios.get(url, { timeout: 10000 });
      return response;
    } catch (error) {
      console.log(`Attempt ${attempt} failed: ${error.message}`);
      if (attempt === retries) throw error;
      await delay(delayMs * attempt);
    }
  }
}

// Use it with a delay between pages
for (const url of urls) {
  const response = await fetchWithRetry(url);
  // ... process response
  await delay(1000); // 1 second between requests
}

A good rule of thumb: 1-2 seconds between requests for small sites, 3-5 seconds for larger ones. Always check robots.txt for crawl delay directives.

Concurrent Scraping with Promise.all

When you need speed and the target site can handle it, scrape multiple pages in parallel:

const urls = Array.from({ length: 50 }, (_, i) =>
  `https://books.toscrape.com/catalogue/page-${i + 1}.html`
);

const BATCH_SIZE = 5;
const allBooks = [];

for (let i = 0; i < urls.length; i += BATCH_SIZE) {
  const batch = urls.slice(i, i + BATCH_SIZE);
  const results = await Promise.all(
    batch.map(async url => {
      const response = await axios.get(url);
      const $ = cheerio.load(response.data);
      const books = [];
      $("article.product_pod").each((_, el) => {
        books.push({
          title: $(el).find("h3 a").attr("title"),
          price: $(el).find(".price_color").text(),
        });
      });
      return books;
    })
  );

  allBooks.push(...results.flat());
  console.log(`Batch done. Total: ${allBooks.length} books`);
  await delay(1000); // Pause between batches
}

Five concurrent requests per batch, with a pause between batches. This scrapes all 50 pages of Books to Scrape in about 10 seconds instead of 50.

Three Real-World Node.js Scraping Projects

Theory is useful. Working projects are better. Here are three scrapers you can run right now.

Project 1: Price Monitoring Bot

Track book prices on Books to Scrape and save the results with timestamps:

import Browserbeam from "@browserbeam/sdk";
import { writeFileSync, existsSync, readFileSync } from "fs";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });

async function checkPrices() {
  const session = await client.sessions.create({
    url: "https://books.toscrape.com",
  });

  await session.extract({
    books: [{
      _parent: "article.product_pod",
      title: "h3 a >> text",
      price: ".price_color >> text",
      inStock: ".instock.availability >> text",
    }],
  });

  const snapshot = {
    timestamp: new Date().toISOString(),
    books: session.extraction.books,
  };

  const historyFile = "output/price-history.json";
  const history = existsSync(historyFile)
    ? JSON.parse(readFileSync(historyFile, "utf-8"))
    : [];

  history.push(snapshot);
  writeFileSync(historyFile, JSON.stringify(history, null, 2));
  console.log(`Tracked ${snapshot.books.length} prices at ${snapshot.timestamp}`);

  await session.close();
}

await checkPrices();

Run this on a cron schedule and you've got a price tracker. Compare timestamps to detect price drops.

Project 2: Job Board Aggregator

Scrape job listings from Fake Python Jobs:

import axios from "axios";
import * as cheerio from "cheerio";
import { writeFileSync } from "fs";

const response = await axios.get("https://realpython.github.io/fake-jobs/");
const $ = cheerio.load(response.data);

const jobs = [];
$(".card").each((i, el) => {
  jobs.push({
    title: $(el).find("h2.title").text().trim(),
    company: $(el).find("h3.company").text().trim(),
    location: $(el).find(".location").text().trim(),
    date: $(el).find("time").text().trim(),
    link: $(el).find("a[href*='jobs']").last().attr("href"),
  });
});

writeFileSync("output/jobs.json", JSON.stringify(jobs, null, 2));
console.log(`Found ${jobs.length} job listings`);

This is a static HTML site, so Cheerio handles it in under a second. Add filtering by title or location to build a job alert system.

Project 3: News Headline Tracker

Scrape the front page of Hacker News for top stories:

import axios from "axios";
import * as cheerio from "cheerio";
import { writeFileSync } from "fs";

const response = await axios.get("https://news.ycombinator.com");
const $ = cheerio.load(response.data);

const stories = [];
$(".athing").each((i, el) => {
  const rank = $(el).find(".rank").text().replace(".", "");
  const titleEl = $(el).find(".titleline > a");
  const headline = titleEl.text();
  const url = titleEl.attr("href");

  stories.push({ rank: parseInt(rank), headline, url });
});

writeFileSync("output/hackernews.json", JSON.stringify(stories, null, 2));
console.log(`Scraped ${stories.length} headlines from Hacker News`);

Hacker News is server-rendered HTML, so Cheerio works perfectly. Run this hourly to track trending topics, or compare snapshots to find stories that climbed the rankings.

Common Node.js Web Scraping Mistakes

These are the five pitfalls that trip up most Node.js scrapers. Each one has burned me at least once.

Mistake 1: Ignoring robots.txt

The mistake: Scraping a site without checking what's allowed.

Why it matters: robots.txt tells crawlers which paths are off-limits. Ignoring it can get your IP blocked, and for some sites, it creates legal exposure.

The fix: Check robots.txt before scraping. It's always at the root: https://example.com/robots.txt. Respect Disallow directives and Crawl-delay values. Libraries like robots-parser can parse the file programmatically.

Mistake 2: Not Handling Async Properly

The mistake: Mixing callbacks, promises, and async/await inconsistently, or forgetting to await a promise.

Why it matters: Unawaited promises fail silently. Your scraper finishes with empty results and no error message. This is the most common debugging headache in Node.js scraping.

The fix: Use async/await everywhere. Never mix patterns. If a function returns a promise, await it. Add try/catch around async operations. Enable the no-floating-promises ESLint rule if you're using TypeScript.

// Bad: missing await, scraper ends before data arrives
const data = fetchPage(url); // Returns a Promise, not data

// Good: await the promise
const data = await fetchPage(url);

Mistake 3: Skipping Error Handling and Retries

The mistake: Assuming every HTTP request will succeed on the first try.

Why it matters: Networks are unreliable. Servers return 429 (rate limited), 503 (overloaded), or just time out. Without retry logic, one failed request kills your entire scrape.

The fix: Wrap requests in a retry function with exponential backoff. See the fetchWithRetry function in the Advanced Patterns section above.

Mistake 4: Hardcoding Selectors Without Validation

The mistake: Using .querySelector(".product-price").textContent without checking if the element exists.

Why it matters: Sites change their markup. A selector that worked yesterday returns null today, and null.textContent crashes your scraper.

The fix: Always check for null before accessing properties. Use optional chaining (?.) in page.evaluate() calls. With Cheerio, check .length before calling .text().

// Bad: crashes if element doesn't exist
const price = $(el).find(".price_color").text();

// Better: check before accessing
const priceEl = $(el).find(".price_color");
const price = priceEl.length ? priceEl.text() : null;

Mistake 5: Not Closing Browser Instances

The mistake: Launching Puppeteer or Playwright browsers without closing them when done or when errors occur.

Why it matters: Unclosed browsers leak memory. Run a scraper in a loop and you'll have dozens of zombie Chromium processes consuming all available RAM within minutes.

The fix: Always close browsers in a finally block:

const browser = await puppeteer.launch();
try {
  const page = await browser.newPage();
  // ... scraping logic
} finally {
  await browser.close();
}

With Browserbeam, call session.close() when you're done. The cloud handles the browser cleanup, but closing the session frees resources and stops the billing clock.

How to Choose the Right Node.js Scraping Library

Picking the right tool depends on three factors: does the page use JavaScript, how many pages are you scraping, and where does your scraper run?

Decision Framework: Which Node.js Scraping Tool?

Does the page render content with JavaScript?

NO → Static HTML

Axios + Cheerio

YES → Needs a browser

Do you want to manage browser infrastructure?

YES → Puppeteer or Playwright

NO → Browserbeam

When to Use Axios + Cheerio

Choose Axios + Cheerio when:
- The target site serves all content in the initial HTML (no client-side rendering)
- You need maximum speed (HTML parsing is 10-100x faster than browser rendering)
- You're scraping thousands of pages and want to minimize resource usage
- You don't need to interact with the page (clicking, scrolling, form filling)

Good fit: documentation sites, blogs, wikis, static e-commerce sites, government databases.

When to Use Puppeteer or Playwright

Choose a local browser when:
- The page loads content with JavaScript (React, Vue, Angular, AJAX calls)
- You need to interact with the page (click buttons, fill forms, scroll)
- You need screenshots or PDF generation
- You're comfortable managing browser infrastructure (Docker, crash handling, memory limits)
- You need raw CDP access for low-level browser control

Puppeteer vs Playwright: If you only need Chrome, Puppeteer is simpler. If you need multi-browser support, auto-wait, or browser contexts, choose Playwright. For a deeper comparison, see our Puppeteer vs Playwright vs Browserbeam analysis.

When to Use a Cloud Browser API

Choose Browserbeam when:
- You want structured output (markdown, JSON extraction) instead of raw HTML
- You don't want to manage browser infrastructure in production
- You need built-in proxy rotation and cookie banner dismissal
- You're building AI agents that consume web data (token-efficient structured output matters)
- You want to skip the Docker, Chromium downloads, and crash-handling boilerplate

Bad fit: offline development, sub-100ms latency requirements, raw CDP protocol access.

Scenario	Best Tool	Why
Scrape 10,000 static product pages	Axios + Cheerio	Fastest, lowest resource usage
Scrape a React SPA with infinite scroll	Puppeteer or Browserbeam	Needs JavaScript execution + scrolling
Build a price monitoring pipeline	Browserbeam	No infra, structured output, proxy built-in
Generate PDFs from web pages	Puppeteer or Playwright	Local control over rendering
Feed web data to an LLM agent	Browserbeam	Token-efficient structured output
Scrape across Chrome, Firefox, Safari	Playwright	Multi-browser from one API

Web Scraping: JavaScript vs Python

This is one of the most common questions, and the answer depends on your existing stack and what you're building.

Factor	JavaScript / Node.js	Python
Async scraping	Native `async/await`, event loop	`asyncio` or `threading` (extra setup)
Static HTML parsing	Cheerio (jQuery-like)	BeautifulSoup (Pythonic)
Browser automation	Puppeteer, Playwright	Selenium, Playwright, Browserbeam
Data analysis	Limited (no pandas equivalent)	pandas, numpy, Jupyter
Package registry	npm (2M+ packages)	PyPI (500K+ packages)
TypeScript	First-class	Type hints (optional, less enforced)
Learning curve for scraping	Easy if you know JS	Easy if you know Python
Community resources	Growing	Largest, most tutorials

Choose JavaScript when:
- Your team already works in JavaScript/TypeScript
- You're scraping JavaScript-heavy sites (you understand the DOM natively)
- You want to share code between your web app and your scrapers
- You need TypeScript's type system for complex data shapes
- Your scraping pipeline feeds into a Node.js backend

Choose Python when:
- You need to analyze scraped data (pandas, Jupyter, matplotlib)
- You're building ML pipelines that consume web data
- You want the largest selection of scraping tutorials and community support
- Your team is already Python-first

The honest answer: both languages scrape equally well. The best choice is the one your team already knows. If you're starting fresh and your only goal is scraping, Python has a slight edge in community resources. If you're a JavaScript developer, there's no reason to learn Python just for scraping. The Node.js tools are mature and capable.

For a full Python walkthrough, see our web scraping with Python guide.

Node.js Web Scraping Best Practices

Practice	Why It Matters
Check `robots.txt` before scraping	Avoids blocks and legal issues
Set a `User-Agent` header	Many sites block requests without one
Add delays between requests (1-3 seconds)	Prevents rate limiting and IP bans
Use `try/catch` and retry logic	Networks fail; scrapers shouldn't
Close browser instances in `finally` blocks	Prevents memory leaks
Validate selectors before extracting	Sites change markup without warning
Export data incrementally	Don't lose hours of work to a crash at page 999
Use TypeScript for complex scrapers	Catches data shape errors at compile time
Rotate proxies for large-scale scraping	Distributes requests across IPs
Log progress and errors	Makes debugging failed scrapes possible

Frequently Asked Questions

How do I scrape a website with Node.js?

Install Axios and Cheerio for static HTML pages, or Puppeteer/Playwright for JavaScript-rendered sites. Fetch the page, parse the HTML with CSS selectors, and extract the data you need. See the step-by-step tutorial in this guide for a working example using Books to Scrape.

Is JavaScript good for web scraping?

Yes. Node.js has a mature scraping toolchain (Cheerio, Puppeteer, Playwright, Browserbeam), native async/await for concurrent requests, and the advantage of running the same language as the pages you're scraping. It's a strong choice for any team already working in JavaScript or TypeScript.

What is the best Node.js web scraping library?

It depends on your use case. Axios + Cheerio is best for static HTML. Puppeteer or Playwright is best for JavaScript-rendered pages. Browserbeam is best for production scraping without browser management. See the comparison table and decision framework in this guide.

Can I scrape JavaScript-rendered pages with Node.js?

Yes. Use Puppeteer, Playwright, or Browserbeam. All three execute JavaScript and give you access to the fully rendered DOM. Cheerio and Axios alone cannot scrape JavaScript-rendered content because they don't execute client-side scripts.

How do I build a web crawler in Node.js?

Start with a seed URL, scrape the page for data and links, add new links to a queue, and repeat. Use a Set to track visited URLs and avoid cycles. Add rate limiting and a maximum depth to prevent runaway crawling. The pagination examples in this guide show the basic pattern.

What is the difference between Cheerio and Puppeteer?

Cheerio parses static HTML strings with a jQuery API. It's fast and lightweight but doesn't execute JavaScript. Puppeteer controls a real Chromium browser that renders pages fully, including JavaScript. Use Cheerio for static sites and Puppeteer for dynamic ones.

Can I use TypeScript for web scraping?

Yes. All major Node.js scraping libraries (Cheerio, Puppeteer, Playwright, Browserbeam SDK) ship with TypeScript declarations. TypeScript adds type safety to your scraped data shapes, catches selector errors at compile time, and improves code maintainability for complex scraping projects.

Is web scraping with JavaScript faster than Python?

For HTTP-based scraping (Axios + Cheerio vs Requests + BeautifulSoup), performance is comparable. Node.js has an edge in concurrent scraping because its event loop handles thousands of simultaneous connections without threading overhead. For browser-based scraping (Puppeteer vs Selenium), performance depends more on the browser engine than the language. The real bottleneck is always the network and the target server's response time.

Conclusion

You've now got the full Node.js scraping toolkit. Axios + Cheerio for static pages. Puppeteer and Playwright for JavaScript-heavy sites. Browserbeam for production scraping without infrastructure headaches. Plus pagination, proxies, retries, and concurrent scraping patterns.

The best way to learn is to pick one of the three projects above and customize it. Change the target site. Adjust the selectors. Add error handling. Break things and fix them.

If you want to skip the browser management and get structured data from any web page with a single API call, try Browserbeam for free. The Node.js SDK installs in seconds and the API documentation covers every endpoint.

For more scraping guides, check out our structured web scraping guide and web scraping in 2026 overview. If you're scaling to thousands of pages, see scaling web automation.

What will you scrape first?

Web Scraping with Node.js: Libraries, Tools and Code Examples

What Is Web Scraping and Why Node.js?

How Web Scraping Works

Why Node.js for Web Scraping

Node.js Web Scraping Libraries Compared

Axios + Cheerio

Puppeteer

Playwright

Browserbeam

Library Comparison

Setting Up Your Node.js Scraping Environment

Installing Core Libraries

Project Structure for Scraping Scripts

TypeScript Setup

Your First Node.js Web Scraper: Step by Step

Step 1: Fetch a Web Page with Axios

Step 2: Parse HTML with Cheerio

Step 3: Extract Data with CSS Selectors

Step 4: Export to JSON or CSV

Scraping JavaScript-Heavy Sites with Node.js

Why Static Scrapers Fail on Modern Sites

Scraping with Puppeteer

Scraping with Playwright

Scraping with a Cloud Browser API

Handling Pagination, Multiple Pages, and Infinite Scroll

Following Next-Page Links

Offset and Cursor-Based Pagination

Infinite Scroll Pages

Advanced Node.js Web Scraping Patterns

Using Proxies for Web Scraping

Managing Cookies and Sessions

Rate Limiting and Retries

Concurrent Scraping with Promise.all

Three Real-World Node.js Scraping Projects

Project 1: Price Monitoring Bot

Project 2: Job Board Aggregator

Project 3: News Headline Tracker

Common Node.js Web Scraping Mistakes

Mistake 1: Ignoring robots.txt

Mistake 2: Not Handling Async Properly

Mistake 3: Skipping Error Handling and Retries

Mistake 4: Hardcoding Selectors Without Validation

Mistake 5: Not Closing Browser Instances

How to Choose the Right Node.js Scraping Library

When to Use Axios + Cheerio

When to Use Puppeteer or Playwright

When to Use a Cloud Browser API

Web Scraping: JavaScript vs Python

Node.js Web Scraping Best Practices

Frequently Asked Questions

Conclusion

You might also like:

How to Build a Web Crawler from Scratch: Python, JavaScript and Ruby

Residential vs Datacenter Proxies: Which One for Web Scraping?

HTML to PDF in 2026: Generate PDFs from Any Web Page

Give your AI agent a faster, leaner browser