How to build your own LinkedIn Profile Scraper

How to build your own LinkedIn Profile Scraper

LinkedIn is full of useful data. From high-profile leads and skilled employee candidates to huge job listings and business opportunities.

All this information can be accessed by hand as it’s made publicly available for all users and non-users. But what if we want to access this data on a larger scale?

Today, we want to show you how you can harness the power of web scraping to pull data from LinkedIn job listings

Prerequisites

You will need the following to understand and build along:

  • An IDE installed - (For this, we’ll be using VSCode)

  • A good internet connection

  • Basic Knowledge of JavaScript

  • Have Node.js installed here

  • If you want to verify if the installation went well, you can use node -v and npm -v

Checking for web scraping permissions

The first thing to consider when you want to scrape a website should be to check whether it grants permission for scraping, and what actions aren’t permitted. Placing a robots.txt text in front of the website like so:

https://www.linkedin.com/robots.txt should give the result below:

linkedln-permission

As seen from the image above, LinkedIn does not permit crawling their account but might be considered after sending enough reasonable and convincing email to .

Scraping LinkedIn Job Postings with JavaScript

Although scraping LinkedIn is legal, we clearly understand LinkedIn itself doesn’t want to be scraped, so you want to be respectful when building your bot.

One thing you’ll avoid on this project is using a headless browser to login into an account and access what would be considered private data. Instead, you’ll focus on scraping public LinkedIn data that doesn’t require us to trespass any login screen.

You’ll go into LinkedIn public job listing page and use Axios and Cheerio to download and parse the HTML to extract the job title, company, location, and URL of the listing.

Creating the project

You will create a new folder for this project in your windows explorer.

mkdir LinkedinScraper
cd LinkedinScraper
npm init

The command above initializes a project and creates a package.json file where the packages you install will be kept. Click enter, and the package.json file will be created. You will get a few prompts on the information you want the file to contain. Take note of the entry point created - index.js.

Installing packages

ExpressJs is a backend framework for Node.js. You will be installing it to listen to PORTS i.e. the port you set for your server. To check if everything works perfectly. Go ahead and run:

npm i express

The command above installs the express dependency for your project.

Cheerio helps to parse markup, it is used to pick out HTML elements from a webpage. It provides an API that allows you to manipulate the resulting data structure.

Run:

npm install cheerio

This will install the Cheerio dependency in the package.json file.

Axios is used to make HTTP requests. Run the command below to install the dependency.

npm install axios

This will also install the axios dependency in the package.json file.

Open the package.json file to see the installed packages.

package-installed

The dependencies field contains the packages you have installed and their versions.

Getting Started

To get the ball rolling, inside the folder, let’s create an index.js file and import your dependencies at the top:

const axios = require('axios');
const cheerio = require('cheerio');

Next, initialize express so that it listens to the PORT you want to use. Let’s say you decide to use PORT: 5000, you should be able to know if the server is running or if it isn’t.

Edit the index.js file to look like this:

const PORT = 5000;
const axios = require('axios');
const cheerio = require('cheerio');
const express = require('express');

const app = express();
app.listen(PORT, () => console.log(`server running on port ${PORT}`));

To check if your server is running on the assigned PORT, run:

npm run start

The display on the terminal should look like this:

web-scrape-terminal

Awesome! It works.

Note: You don’t always have to type npm run start when you make a change to your script, nodemon takes care of reloading when you save your changes.

Now focusing on the actual scraping, get the URL of the website you want to scrape.

Axios takes this URL, makes an HTTP request, and then returns response data. This response data can be displayed in the terminal.

Effect this in your index.js:

const url = "`[`https://coinmarketcap.com/`](https://coinmarketcap.com/)";
axios.get(url).then((response) => {
   const html_data = response.data;
   const $ = cheerio.load(html_data);
})

From the code above, you will notice that the response gotten from the HTTP request is assigned to the variable html\_data.

Understanding Cheerio

In the code snippet above, you loaded the HTML elements into Cheerio using the .load() method and stored it in the $ variable, similar to jQuery. With the elements loaded you can retrieve DOM elements based on the data you need.

Cheerio makes it possible to navigate through the DOM elements and manipulate them, this is done by targeting tags, classes, ids, and hrefs. For example, an element with a class of submitButton can be represented as $('.submitButton'), id as $('#submitButton'), and also pick an h1 element by using $('h1'). Cheerio provides methods like find() to find elements, each() to iterate through elements, and the filter() method, amongst others.

Parsing the HTML with Cheerio

Before parsing an HTML page, you must first inspect the structure of the page. In this case, you want to pick the name of each Job title, Company, Location, and URL.

To pick the right selectors, you can go back to the opened URL and take note of the attributes you can use. You noticed that all the data needed is inside of individual <li>; elements, so target that first and store it into a constant.

const jobs = $('li');

With all the listings stored inside jobs, you can now go one by one and extract the specific bits of data we’re looking for. You can test your selectors right inside the DevTools to avoid sending unnecessary requests to the server.

Scraping the LinkedIn Profile

There are many ways to navigate to the next page, but with your current knowledge of the website, the easiest way would be to increase the start parameter in the URL by 20 to display the next 25 jobs until there are no more results and a for loop meets this functionality perfectly.

const axios = require("axios");
const cheerio = require("cheerio");
linkedinJobs = [];
for (let pageNumber = 0; pageNumber < 1000; pageNumber += 25) {
   let url = [https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&currentJobId=2931031787&position=1&pageNum=0&start=${pageNumber}](https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&currentJobId=2931031787&position=1&pageNum=0&start=${pageNumber});
  axios(url)
    .then((response) => {
      const html = response.data;
      const $ = cheerio.load(html);
      const jobs = $("li");
      jobs.each((index, element) => {
        const jobTitle = $(element)
          .find("h3.base-search-card__title")
          .text()
          .trim();
        const company = $(element)
          .find("h4.base-search-card__subtitle")
          .text()
          .trim();
        const location = $(element)
          .find("span.job-search-card__location")
          .text()
          .trim();
        const link = $(element).find("a.base-card__full-link").attr("href");
        linkedinJobs.push({
          Title: jobTitle,
          Company: company,
          Location: location,
          Link: link,
        });
      });
    .catch(console.error);
}

The above code snippet did the following:

  • The starting point will be 0, as that’s the first value you want to pass to the start parameter

  • Because you know that when hitting 1000, there won’t be any more results, you want the code to run as long as pageNumber is less than 1000

  • Finally, after every iteration of the code, you want to increase pageNumber by 25, effectively moving to the next page

  • Used the pageNumber variable as the value for the start parameter.

Writing the LinkedIn Data to a CSV File

You’ll be using the Objects-to-CSV package. They have detailed documentation if you’d like to go deeper into the package. In simple terms, it will convert the array of JavaScript objects (linkedinJobs) into a CSV format you can save to your machine.

First, you'll install the package using npm install objects-to-csv and add it to the top of your project.

const ObjectsToCsv = require('objects-to-csv');

You can now add use the package right after closing your job.each() method:

const csv = new ObjectsToCsv(linkedinJobs)
csv.toDisk('./linkedInJobs.csv', { append: true})

“The keys in the first object of the array will be used as column names” so it’s important that you make them descriptive when using the .push() method.

Also, because you want to loop through several pages, you don’t want your CSV to be overwritten every time but to add the new data below. To do so, all you need to do is set append to true. It will only add the headers once and keep updating the file with the new data.

If you’ve followed along, here’s what your finished code should look like:

const axios = require("axios");
const cheerio = require("cheerio");
const ObjectsToCsv = require("objects-to-csv");
linkedinJobs = [];
for (let pageNumber = 0; pageNumber < 1000; pageNumber += 25) {
  let url = [https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&currentJobId=2931031787&position=1&pageNum=0&start=${pageNumber}](https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&currentJobId=2931031787&position=1&pageNum=0&start=${pageNumber});
  axios(URL)
    .then((response) => {
      const html = response.data;
      const $ = cheerio.load(html);
      const jobs = $("li");
      jobs.each((index, element) => {
        const jobTitle = $(element)
          .find("h3.base-search-card__title")
          .text()
          .trim();
        const company = $(element)
          .find("h4.base-search-card__subtitle")
          .text()
          .trim();
        const location = $(element)
          .find("span.job-search-card__location")
          .text()
          .trim();
        const link = $(element).find("a.base-card__full-link").attr("href");
        linkedinJobs.push({
          Title: jobTitle,
          Company: company,
          Location: location,
          Link: link,
        });
      });
      const csv = new ObjectsToCsv(linkedinJobs);
      csv.toDisk("./linkedInJobs.csv", { append: true });
    })
    .catch(console.error);
}

To run it, go to your terminal and type npm run start

linkedln-scraping-terminal

Conclusion

In this project, you have learned how to scrape data from a LinkedIn job listing website. You have also become familiar with parsing HTML elements with Cheerio as well as manipulation. With this knowledge you can scrape through any website of your choice, but note that it is essential to first check for legal policies before scraping a site.

Did you find this article valuable?

Support BigSam Blog by becoming a sponsor. Any amount is appreciated!