Raddy Website Design & Development Tutorials | RaddyDev

Build a Simple Web Scraper using Node.JS, Fetch and Cheerio

By Raddy in NodeJs ·

In this tutorial, we are going to scrape the Formula 1 Drivers 2022 from the official Formula 1 website using Node.JS, Node-Fetch and Cheerio. The main reason why I chose Node-Fetch and Cheerio is simply that many people will be familiar with the syntax and both of them are very easy to use and understand.

Cheerio uses core jQuery which makes selecting elements extremely easy and if you have worked with Fetch on the front end of the web, then this would be very familiar to you. Also, I need to mention that Fetch will be bundled in Node.js in the near future so you won’t have to install it as a separate package. That’s always a plus.

Let’s learn more about web scraping…

What is web scraping?

Web scrapingweb harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

https://en.wikipedia.org/wiki/Web_scraping

Where is web scraping used?

Web Scraping can be used for pretty much everything from E-Commerce, Data Science, Job Boards, Marketing and Sales, Finance, Data Journalism and so on.

To give you some examples you can build apps such as News Aggregator, Job Search portal, Specific Search Engine, Competitor Analyze Tool, Best Price Finder and so much more!

Is web scraping illegal?

Web Scraping isn’t illegal by itself, but the problem arises when people disregard websites’ terms of service and scrape without permission.

Basically DON’T copy data that is copyrighted.

It is legal if you scrape data from websites for public consumption and use it for analysis. However, it is not legal if you scrape confidential data for profit.

Create a Simple Web Scraper

Before you get started make sure that you have Node.Js installed and we’ll be using the official formula 1 website which you can view here.

Let’s create our web scraper.

To do that, create a new project folder called “Formula1” (or whatever you wish) and then run the following command in Command Line (Mac / Linux) or Powershell (Windows).

npm init

This will initialise a new project for you and it’s going to ask you a few questions about your project. The most important one is to give your package a name and then you can just keep pressing enter until the installation is over.

You can skip all questions by adding the “-y” flag like in the example below:

npm init -y

At this point, you should see a file called package.json in your project folder.

Dependencies Installation

We have to install a few dependencies:

[x] cheerio
[x] node-fetch (soon to be included in Node.js)

Open the Command Line / Terminal / Powershell and install the dependencies listed above just like so:

npm install cheerio node-fetch

Your packages.json file should look similar to this:

{
  "name": "f1",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "cheerio": "^1.0.0-rc.10",
    "node-fetch": "^3.2.3"
  }
}

If you would like to use ECMAScript6 support we need to add one more line. This will add support for ES modules so we can use import in our app.js instead of require. If you wish to use require, please skip this step.

{
  "name": "f1",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "type": "module",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "cheerio": "^1.0.0-rc.10",
    "node-fetch": "^3.2.3"
  }
}

* Note that the version of the dependencies might change in the future.

Project Structure

Our project is going to be simple. We just need our main application file which we will call ‘app.js‘.

πŸ“‚ node_modules
🌍 app.js
πŸ“œ package-lock.json
πŸ“œ package-json

Creating our application – app.js

1) Import NPM Packages

Create the app.js file inside the main directory just like in the project structure model above. Let’s start by importing the libraries that we installed earlier in this tutorial so they can be used.

// NPM packages that we installed
import * as cheerio from 'cheerio';
import fetch from 'node-fetch';

2) Fetch data from URL

In order to fetch data from a specific URL let’s create an asynchronous function called “getFormulaOneDrivers“. You can also use an IFFE if you like. An IIFE (Immediately Invoked Function Expression) is a JavaScript function that runs as soon as it is defined.

Since we are using async we can wrap everything inside into a try-catch statement like in the example below.

// Importing the NPM packages that we installed
import * as cheerio from 'cheerio';
import fetch from 'node-fetch';

// Function starts here
async function getFormulaOneDrivers() {
  try {
    // Fetch data from URL and store the response into a const
    const response = await fetch('https://www.formula1.com/en/drivers.html');
    // Convert the response into text
    const body = await response.text();
    
  } catch (error) {
    console.log(error);
  }
}

// Run getFormulaOneDrivers
getFormulaOneDrivers();

At this point, you should be able to fetch the URL and extract the source code from the page. Now we need to be able to select elements from that page so we can store them.

3) Load data to Cheerio

In order to load the data that we just got back from the URL we need to load it into Cheerio. For this we can use cheerio.load() and pass the body data inside it.

// Importing the NPM packages that we installed
import * as cheerio from 'cheerio';
import fetch from 'node-fetch';

// Function starts here
async function getFormulaOneDrivers() {
  try {
    // Fetch data from URL and store the response into a const
    const response = await fetch('https://www.formula1.com/en/drivers.html');
    // Convert the response into text
    const body = await response.text();
    
    // Load body data
    const $ = cheerio.load(body);
    
  } catch (error) {
    console.log(error);
  }
}

// Run getFormulaOneDrivers
getFormulaOneDrivers();

4) Select data using Cheerio

Now we can start thinking about the data that we want to scrape. In this case, I want to be able to scrape the Rank, Points, First Name, Last Name, Team and Photo of the driver.

formula 1

If you open the page that we are trying to scrape – link here and inspect the page by doing a right-click and selecting inspect. This should open the Inspect tool inside your browser which will help us navigate through the DOM of the website and select the elements that we need.

inspect tool

When you inspect the page, you will have to mess around and find the container that is holding each driver. Essentially we want to iterate through the list of col-12’s and get the data. Sometimes it’s tricky to do, but in our case, the data is fairly well structured and we can use the class names “listing-items–wrapper, row and col-12” to select.

Let’s jump back into the code editor and do that:

// Importing the NPM packages that we installed
import * as cheerio from 'cheerio';
import fetch from 'node-fetch';

// Function starts here
async function getFormulaOneDrivers() {
  try {
    // Fetch data from URL and store the response into a const
    const response = await fetch('https://www.formula1.com/en/drivers.html');
    // Convert the response into text
    const body = await response.text();
    
    // Load body data
    const $ = cheerio.load(body);
    
    // Selecting Each col-12 class name and iterate through the list
    $('.listing-items--wrapper > .row > .col-12').map((i, el) => {
      
    });
      
    
  } catch (error) {
    console.log(error);
  }
}

// Run getFormulaOneDrivers
getFormulaOneDrivers();

Now we can start digging deeper and select other elements such as the Rank. If you inspect the page one more time, hover over the Rank and see if you can find a class name that we can use.

rank

As you can see in the photo above, we are fairly lucky that the rank div has a class name of “rank” that can be used to select. Note that we only want to select the text so we’ll use the Cheerio text method to do that.

// Importing the NPM packages that we installed
import * as cheerio from 'cheerio';
import fetch from 'node-fetch';

// Function starts here
async function getFormulaOneDrivers() {
  try {
    // Fetch data from URL and store the response into a const
    const response = await fetch('https://www.formula1.com/en/drivers.html');
    // Convert the response into text
    const body = await response.text();
    
    // Load body data
    const $ = cheerio.load(body);
    
    // Selecting Each col-12 class name and iterate through the list
    $('.listing-items--wrapper > .row > .col-12').map((i, el) => {
      
      // Select the rank class name and use the text method to only grab the content
      const rank = $(el).find('.rank').text();
      
      console.log(rank);
    });
      
    
  } catch (error) {
    console.log(error);
  }
}

// Run getFormulaOneDrivers
getFormulaOneDrivers();

You can now run the application and see if you get the Rank of the drivers in the console. To do that use Node.Js to run the “app.js” file.

node app.js

Run the command above and hopefully, you should get the same result as mine is shown below:

cheerio log data

That’s pretty much it. I am going to add a few more examples and create an empty array that we can use to push the data. That array can then be used to do whatever you wish. In the video, I used it to save the data on my local machine. You can try to save it as an excel file if you wish.

// Importing the NPM packages that we installed
import * as cheerio from 'cheerio';
import fetch from 'node-fetch';

// Function starts here
async function getFormulaOneDrivers() {
  try {
    // Fetch data from URL and store the response into a const
    const response = await fetch('https://www.formula1.com/en/drivers.html');
    // Convert the response into text
    const body = await response.text();
    
    // Load body data
    const $ = cheerio.load(body);
    
    // Create empty array
    const items = [];
    
    // Selecting Each col-12 class name and iterate through the list
    $('.listing-items--wrapper > .row > .col-12').map((i, el) => {
      
      // Select rank, points, first name, last name, team and photo
      const rank = $(el).find('.rank').text();
      const points = $(el).find('.points > .f1-wide--s').text();
      const firstName = $(el).find('.listing-item--name span:first').text();
      const lastName = $(el).find('.listing-item--name span:last').text();
      const team = $(el).find('.listing-item--team').text();
      const photo = $(el).find('.listing-item--photo img').attr('data-src');

	  // Push the data into the items array
      items.push({
        rank,
        points,
        firstName,
        lastName,
        team,
        photo
      });
      
    });
      
    console.log(items);
    
  } catch (error) {
    console.log(error);
  }
}

// Run getFormulaOneDrivers
getFormulaOneDrivers();

Your data should look something similar to this:

[
  {
    "rank": "1",
    "points": "71",
    "firstName": "Charles",
    "lastName": "Leclerc",
    "team": "Ferrari",
    "photo": "https://www.formula1.com/content/dam/fom-website/drivers/C/CHALEC01_Charles_Leclerc/chalec01.png.transform/2col/image.png"
  },
  {
    "rank": "2",
    "points": "37",
    "firstName": "George",
    "lastName": "Russell",
    "team": "Mercedes",
    "photo": "https://www.formula1.com/content/dam/fom-website/drivers/G/GEORUS01_George_Russell/georus01.png.transform/2col/image.png"
  },
  
  .....

That’s all!

I hope that the tutorial was clear and informative. Sometimes is easier to explain and show things in video format. If you have ANY suggestions on how to make the tutorial a little bit easier to understand please do let me know. Nevertheless, If you found this tutorial useful let me know in the comments below. This way I will know if I should be making more.

Thank you for reading this tutorial. Share it with friends and family.

More Resources:

Thank you for reading this article. Please consider subscribing to my YouTube Channel. It’s FREE!

Leave a Reply

Your email address will not be published. Required fields are marked *