Web scraping in Node.js with Axios and Cheerio

Web scraping in Node.js with Axios and Cheerio

Axios and Cheerio is a powerful combination for web scraping in Node.js, fitting the bill for both experienced and beginner scrapers.

Introduction and requirements

Web scraping (also known as web data extraction or data scraping) is the automated process of extracting data from the web in a comprehensible and structured format.

To perform web scraping, we need to use an HTTP client, like Axios, to send requests to the target website and retrieve information such as the website's HTML code.

Next, we feed the obtained code to an HTML parser, in this case, Cheerio, which will help us select specific elements in the code and extract their data.

Our goal in this tutorial is to build a Hacker News scraper using the Axios and Cheerio Node.js libraries to extract the rank, link, title, author, and points from each article displayed on the first page of the website.

Screenshot of Hacker News front page to accompany article on using Axios, Cheerio and Node.js to scrape data from Hacker News

Using Axios, Cheerio and Node.js to scrape data from Hacker News will enable us to get the rank, link, title, author and points from each article on the first page

Requirements

Initial setup

First, let's create a new directory hacker-news-scraper to house our scraper, then move into it and create a new file named main.js. We can either do it manually or straight from the terminal by using the following commands:

mkdir hacker-news-scraper

cd hacker-news-scraper

touch main.js

Still in the terminal, let's initialize our Node.js project and install Axios and Cheerio. Finally, we can open our project in our code editor of choice. Since I'm using VS Code, I can type the command code . to open the current directory in VS Code.

npm init -y

npm install axios cheerio

code .

Right after we open our project, we can expect to see a node_modules folder, the main.js, package-lock.json and package.json files.

Starting files

Next, let's add "type": "module" to our package.json file. This will give us access to import declarations and top-level awaits, which means we can use the await keyword outside of async functions.

Since we are already in the package.json file, let's also add a script to run our scraper by using the command npm start. To do that, we just have to include the string "start": "node main.js" to the existing "scripts" object.

File in VS Code showing how to add code to start scraper

And now, we are ready to move to the next step and start writing some code in our main.js file.

How to make an HTTP GET request with Axios

In the main.js file, we will use Axios to make a GET request to our target website and save the obtained HTML code of the page to a variable named html and log it to the console.

Code

import axios from "axios";

const response = await axios.get("https://news.ycombinator.com/");
const html = response.data;
console.log(html);

Output

And here is the result we expect to see after running the npm start command:

website HTML code

Great! Now that we are properly targeting the page's HTML code, it's time to use Cheerio to parse the code and extract the specific data we want.

Parsing the data with Cheerio

Next, let's use Cheerio to parse the HTML data and scrape the contents from all the articles on the first page of Hacker News.

const axios = require("axios");
const cheerio = require("cheerio");

(async () => {
    const response = await axios.get("https://news.ycombinator.com/");
    const html = response.data;

    // Use Cheerio to parse the HTML
    const $ = cheerio.load(html);
})();

Now that Cheerio is loading and parsing the HTML, we can use the variable $ to select elements on the page.

But before we select an element, let's use the developer tools to inspect the page and find what selectors we need to use to target the data we want to extract.

Developer tools analyzing Hacker News web page

When analyzing the website's structure, we can find each article's rank and title by selecting the element containing the class athing.

So, let's use Cheerio to select all elements containing the athing class and save them to a variable named articles.

Next, to verify we have successfully selected the correct elements, let's loop through each article and log its text contents to the console.

Code

import axios from "axios";
import * as cheerio from "cheerio";

const response = await axios.get("https://news.ycombinator.com/");
const html = response.data;

// Use Cheerio to parse the HTML
const $ = cheerio.load(html);

// Select all the elements with the class name "athing"
const articles = $(".athing");

// Loop through the selected elements
for (const article of articles) {
    const text = $(article).text().trim();
    // Log each article's text content to the console
    console.log(text);
}

Output


1.      Hyundai Head Unit Hacking (xakcop.com)
2.      The Art of Knowing When to Quit (jim-nielsen.com)
3.      Tailscale bug allowed a person to share nodes from other tailnets without auth (tailscale.com)
4.      Show HN: Plus -- Self-updating screenshots (plusdocs.com)
5.      Ruby 3.2's YJIT is Production-Ready (shopify.engineering)
6.      In the past, I've had students call my problem sets "emotionally trying" (twitter.com/shengwuli)
7.      EV batteries alone could satisfy short-term grid storage demand as early as 2030 (nature.com)
8.      Ask HN: Has anyone worked at the US National Labs before?
9.      Let's build GPT: from scratch, in code, spelled out by Andrej Karpathy [video] (youtube.com)
10.      Is a venture studio right for you? (steveblank.com)
11.      How Do AIs' Political Opinions Change as They Get Smarter and Better-Trained? (astralcodexten.substack.com)
12.      I Am the Ghost Here (guernicamag.com)
13.      ChrysaLisp (github.com/vygr)
14.      Git security vulnerabilities announced (github.blog)
15.      Show HN: A tool for motion-capturing 3D characters using a VR headset (diegomacario.github.io)
16.      A flurry of new studies identifies causes of the Industrial Revolution (economist.com)
17.      My grandfather was almost shot down at the White House (2018) (nones-leonard.medium.com)
18.      Common Lisp and Music Composition (ldbeth.sdf.org)
19.      Cultivating Depth and Stillness in Research (andymatuschak.org)
20.             Patterns (YC S21) is hiring (patterns.app)
21.      A Not-So-Brief History of the United States Navy Steel Band (panonthenet.com)
22.      Glitching a microcontroller to unlock the bootloader (grazfather.github.io)
23.      The Metapict Blog -- TikZ like figures using Racket (soegaard.github.io)
24.      Show HN: Stack-chan -- Open-source companion robot easy to assemble and customize (github.com/meganetaaan)
25.      A new scan to detect and cure the commonest cause of high blood pressure (qmul.ac.uk)
26.      Learning Physics With Ringworld (2010) (tor.com)
27.      What's going on in the world of extensions (blog.mozilla.org)
28.      UT-Austin blocks access to TikTok on campus Wi-Fi networks (texastribune.org)
29.      The Amagasaki Derailment [video] (youtube.com)
30.      We could stumble into AI catastrophe (effectivealtruism.org)

Great! We've managed to access each element's rank and title. However, we are still missing the article's URL, points, and author.

In the next step, we will use Cheerio's find method to grab the missing values and organize the obtained data in a JavaScript object.

The Cheerio find method

The find method is used to get the descendants of an element in the current set of matched elements filtered by a selector.

In the context of our scraper, we can use find to select specific descendants of each article element.

Returning to the Hacker News website, we can find the selectors we need to extract our target data.

Element selectors illustrating the CSS selectors used on the Hacker News website

Code

Here's what our code looks like now:

import axios from "axios";
import * as cheerio from "cheerio";

const response = await axios.get("https://news.ycombinator.com/");
const html = response.data;

// Use Cheerio to parse the HTML
const $ = cheerio.load(html);

// Select all the elements with the class name "athing"
const articles = $(".athing");

// Loop through the selected elements
for (const article of articles) {
    // Organize the extracted data in an object
    const structuredData = {
        url: $(article).find(".titleline a").attr("href"),
        rank: $(article).find(".rank").text().replace(".", ""),
        title: $(article).find(".titleline").text(),
        author: $(article).find("+tr .hnuser").text(),
        points: $(article).find("+tr .score").text().replace(" points", ""),
    };

    // Log each element's strcutured data results to the console
    console.log(structuredData);
}

Output

And after running node main.js we can expect the following output:


{
  url: 'https://cookieplmonster.github.io/2023/01/15/remastering-colin-mcrae-rally-3-silentpatch/',
  rank: '1',
  title: 'Remastering Colin McRae Rally 3 with SilentPatch (cookieplmonster.github.io)',
  author: 'breakingcups',
  points: '246'
}
{
  url: 'https://www.geoffreylitt.com/2023/01/08/for-your-next-side-project-make-a-browser-extension.html',
  rank: '2',
  title: 'For your next side project, make a browser extension (geoffreylitt.com)',
  author: 'Glench',
  points: '130'
}
{
  url: 'https://www.fosslife.org/awk-power-and-promise-40-year-old-language',
  rank: '3',
  title: 'Awk: Power and Promise of a 40 yr old language (2021) (fosslife.org)',
  author: 'sargstuff',
  points: '58'
}
{
  url: 'https://jackevansevo.github.io/revisiting-kde.html',
  rank: '4',
  title: 'Revisiting KDE (jackevansevo.github.io)',
  author: 'rc00',
  points: '190'
}
{
  url: 'https://community.stadia.com/t5/Stadia-General/A-Gift-from-the-Stadia-Team-amp-Bluetooth-Controller/m-p/85936#M34875',
  rank: '5',
  title: 'Google announces update to unlock Stadia controllers to work with other devices (stadia.com)',
  author: 'anderspitman',
  points: '290'
}
{
  url: 'https://furbo.org/2023/01/15/the-shit-show/',
  rank: '6',
  title: 'The Shit Show (furbo.org)',
  author: 'chazeon',
  points: '393'
}
{
  url: 'https://chriswarrick.com/blog/2023/01/15/how-to-improve-python-packaging/',
  rank: '7',
  title: 'How to improve Python packaging (chriswarrick.com)',
  author: 'Kwpolska',
  points: '157'
}
{
  url: 'https://www.instructables.com/DIY-Raspberry-Orange-Pi-NAS-That-Really-Looks-Like/',
  rank: '8',
  title: 'DIY Raspberry / Orange Pi NAS That Looks Like a NAS -- 2023 Edition (instructables.com)',
  author: 'axiomdata316',
  points: '91'
}
{
  url: 'https://viewfromthewing.com/what-we-know-now-about-friday-nights-near-disaster-at-jfk-airport/',
  rank: '9',
  title: 'What we know now about Friday night's near-disaster at JFK airport (viewfromthewing.com)',
  author: 'bgc',
  points: '86'
}
{
  url: 'https://nliu.net/posts/2021-03-19-interview.html',
  rank: '10',
  title: 'Subverting the software interview (2021) (nliu.net)',
  author: 'g0xA52A2A',
  points: '137'
}
{
  url: 'https://chriskiehl.com/article/practical-lenses',
  rank: '11',
  title: 'Making Lenses Practical in Java (chriskiehl.com)',
  author: 'goostavos',
  points: '36'
}
{
  url: 'https://www.construct.net/en/blogs/ashleys-blog-2/rts-devlog-beat-lag-1607',
  rank: '12',
  title: 'How to beat lag when developing a multiplayer RTS game (construct.net)',
  author: 'AshleysBrain',
  points: '51'
}
{
  url: 'https://arxiv.org/abs/2201.12601',
  rank: '13',
  title: 'A formula for the nth digit of 𝜋 and 𝜋^n (arxiv.org)',
  author: 'georgehill',
  points: '206'
}
{
  url: 'https://www.infoq.com/articles/architecture-skeptics-guide/',
  rank: '14',
  title: 'A skeptic's guide to software architecture decisions (infoq.com)',
  author: 'valand',
  points: '17'
}
{
  url: 'https://blog.alexewerlof.com/p/tech-debt-day',
  rank: '15',
  title: "We invested 10% to pay back tech debt; Here's what happened (alexewerlof.com)",
  author: 'hanifbbz',
  points: '9'
}
{
  url: 'https://www.neuralframes.com',
  rank: '16',
  title: 'Show HN: Create your own video clips with Stable Diffusion (neuralframes.com)',
  author: 'nicollegah',
  points: '166'
}
{
  url: 'https://maritime.org/tour/seashadow/index.php',
  rank: '17',
  title: 'Virtual Tour of the Hughes Mining Barge and Sea Shadow (maritime.org)',
  author: 'walrus01',
  points: '8'
}
{
  url: 'https://www.theregister.com/2023/01/14/in_brief_security/',
  rank: '18',
  title: 'NSA asks Congress to let it get on with that warrantless data harvesting, again (theregister.com)',
  author: 'LinuxBender',
  points: '164'
}
{
  url: 'https://skio.com/careers/',
  rank: '19',
  title: 'Skio (YC S20) Is Hiring (skio.com)',
  author: '',
  points: ''
}
{
  url: 'item?id=34392783',
  rank: '20',
  title: 'Tell HN: Repurposing old iPads as home security cameras',
  author: 'evo_9',
  points: '73'
}
{
  url: 'item?id=34388866',
  rank: '21',
  title: 'Ask HN: How do you trust that your personal machine is not compromised?',
  author: 'coderatlarge',
  points: '400'
}
{
  url: 'https://blog.revolutionanalytics.com/2014/01/the-fourier-transform-explained-in-one-sentence.html',
  rank: '22',
  title: 'The Fourier Transform, explained in one sentence (2014) (revolutionanalytics.com)',
  author: 'signa11',
  points: '397'
}
{
  url: 'https://github.com/furkanonder/beetrace',
  rank: '23',
  title: 'Trace your Python process line by line with minimal overhead (github.com/furkanonder)',
  author: 'fywvzqhvnn',
  points: '35'
}
{
  url: 'item?id=34393273',
  rank: '24',
  title: 'Tell HN: Windows 10 might have tricked you into using a online account',
  author: 'xchip',
  points: '71'
}
{
  url: 'https://github.com/Enerccio/SLT',
  rank: '25',
  title: 'SLT -- A Common Lisp Language Plugin for Jetbrains IDE Lineup (github.com/enerccio)',
  author: 'gjvc',
  points: '117'
}
{
  url: 'https://www.fastcompany.com/90270226/the-origins-of-silicon-valleys-garage-myth',
  rank: '26',
  title: "The Origins of Silicon Valley's Garage Myth that (2018) (fastcompany.com)",
  author: '2-718-281-828',
  points: '21'
}
{
  url: 'https://goodereader.com/blog/kindle/amazon-is-no-longer-allowing-downloading-kindle-unlimited-titles-via-usb',
  rank: '27',
  title: 'Amazon is no longer allowing downloading Kindle Unlimited titles via USB (goodereader.com)',
  author: 'dodgermax',
  points: '130'
}
{
  url: 'https://gitlab.com/tsoding/porth',
  rank: '28',
  title: "Porth, It's Like Forth but in Python (gitlab.com/tsoding)",
  author: 'Alifatisk',
  points: '91'
}
{
  url: 'https://goldensyrupgames.com/blog/2023-01-14-gobgp-windows/',
  rank: '29',
  title: 'BGP on Windows Desktop (goldensyrupgames.com)',
  author: 'GSGBen',
  points: '22'
}
{
  url: 'https://useadrenaline.com',
  rank: '30',
  title: 'Show HN: AI-powered code correction that teaches you along the way (useadrenaline.com)',
  author: 'jshobrook',
  points: '65'
}

Congratulations! We've just scraped information from all the articles displayed on the first page of Hacker News using Axios and Cheerio.

In theory, we've accomplished our goal. However, there are still challenges that we might come across when scraping the web, and getting blocked is one of the most common issues web scrapers face.

Avoid being blocked with Axios

Hacker News is a simple website without any aggressive anti-bot protections in place, so we were able to scrape it without running into any major blocking issues.

Complex websites might employ different techniques to detect and block bots, such as analyzing the data encoded in HTTP requests received by the server, fingerprinting, CAPTCHAS, and more.

Avoiding all types of blocking can be a very challenging task, and its difficulty varies according to your target website and the scale of your scraping activities.

Nevertheless, there are some simple techniques, like passing the correct User-Agent header that can already help our scrapers pass basic website verifications.

What is the User-Agent header?

The User-Agent header informs the server about the operating system, vendor, and version of the requesting client. This is relevant because any inconsistencies in the information the website receives can alert it about suspicious bot-like activity, leading to our scrapers getting blocked.

One of the ways we can avoid this is by passing custom headers to the HTTP request we made earlier using Axios, thus ensuring that the User-Agent used matches the one from the machine sending the request.

You can check your own User-Agent by accessing the http://whatsmyuseragent.org/ website. For example, this is my computer's User-Agent:

User-Agent example

With this information, we can now pass the User-Agent header to our Axios HTTP request.

How to use the User-Agent header in Axios

In order to verify that Axios is indeed sending the specified headers, let's create a new file named headers-test.js and send a request to the website https://httpbin.org/.

To send custom headers using Axios, we will pass a params parameter to the request method:

import axios from "axios";

const params = {
    headers: {
        "User-Agent":
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
    },
};

const response = await axios.get("https://httpbin.org/headers", params);
console.log(response.data);

After running the node headers-test.js command, we can expect to see our request headers printed to the console:

Custom User-Agent header

As we can verify by checking the User-Agent, Axios used the custom headers we passed as a parameter to the request.

In contrast, that's how the User-Agent for the same request would look like if we didn't pass any custom parameters:

Axios User-Agent header

Cool, now that we know how to properly pass custom headers to an Axios HTTP request, we can implement the same logic in our Hacker News scraper.

Required headers, cookies, and tokens

Setting the proper User-Agent header will definitely help you avoid blocking, but it is not enough to overcome more sophisticated anti-bot systems present in modern websites.

There are many other types of information, such as additional headers, cookies, and access tokens, that we might be required to send with our request in order to get to the data we want. If you want to know more about the topic, check out the Dealing with headers, cookies, and tokens section of the Apify Web Scraping Academy.

An alternative to Axios

Despite Axios being a solid choice for scraping, it was not primarily designed for the needs of modern web scraping, and because of that, it requires extra setup to ensure that its requests are not easily blocked.

Got-scraping, on the other hand, is an open-source HTTP client maintained by Apify, which was made for scraping. Its purpose is to send browser-like requests out of the box, helping our scrapers blend in with the website traffic.

To demonstrate that, let's first add got scraping to our project by running the following command:

npm install got-scraping

Now, let's go back to our headers-test.js file and modify the code to use Got-Scraping instead of Axios.

import { gotScraping } from "got-scraping";

const response = await gotScraping.get("https://httpbin.org/headers");
console.log(response.body);

Next, run the command node headers-test.js to see the headers that Got Scraping automatically added to the request.

auto generated headers

Note that Got-scraping included the correct User-Agent without us having to pass any additional parameters to the request like we did for Axios.

Not only that, but it also included additional headers that will help our requests look more "human-like" and not be blocked by the target website.

Final code

Using Axios:

import axios from "axios";
import * as cheerio from "cheerio";

const params = {
    headers: {
        "User-Agent":
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
    },
};

const response = await axios.get("https://news.ycombinator.com/", params);
const html = response.data;

// Use Cheerio to parse the HTML
const $ = cheerio.load(html);

// Select all the elements with the class name "athing"
const articles = $(".athing");

// Loop through the selected elements
for (const article of articles) {
    // Organize the extracted data in an object
    const structuredData = {
        url: $(article).find(".titleline a").attr("href"),
        rank: $(article).find(".rank").text().replace(".", ""),
        title: $(article).find(".titleline").text(),
        author: $(article).find("+tr .hnuser").text(),
        points: $(article).find("+tr .score").text().replace(" points", ""),
    };

    // Log each element's strcutured data results to the console
    console.log(structuredData);
}

Using Got-scraping:

import { gotScraping } from "got-scraping";
import * as cheerio from "cheerio";

const response = await gotScraping.get("https://news.ycombinator.com/");
const html = response.body;

// Use Cheerio to parse the HTML
const $ = cheerio.load(html);

// Select all the elements with the class name "athing"
const articles = $(".athing");

// Loop through the selected elements
for (const article of articles) {
    // Organize the extracted data in an object
    const structuredData = {
        url: $(article).find(".titleline a").attr("href"),
        rank: $(article).find(".rank").text().replace(".", ""),
        title: $(article).find(".titleline").text(),
        author: $(article).find("+tr .hnuser").text(),
        points: $(article).find("+tr .score").text().replace(" points", ""),
    };

    // Log each element's strcutured data results to the console
    console.log(structuredData);
}