LinkedIn is full of useful data. From high-profile leads and skilled employee candidates to huge job listings and business opportunities.
All this information can be accessed by hand as it’s made publicly available for all users and non-users. But what if we want to access this data on a larger scale?
Today, we want to show you how you can harness the power of web scraping to pull data from LinkedIn job listings
Prerequisites
You will need the following to understand and build along:
An IDE installed - (For this, we’ll be using VSCode)
A good internet connection
Basic Knowledge of JavaScript
Have Node.js installed here
If you want to verify if the installation went well, you can use
node -v
andnpm -v
Checking for web scraping permissions
The first thing to consider when you want to scrape a website should be to check whether it grants permission for scraping, and what actions aren’t permitted. Placing a robots.txt text in front of the website like so:
https://www.linkedin.com/robots.txt should give the result below:
As seen from the image above, LinkedIn does not permit crawling their account but might be considered after sending enough reasonable and convincing email to whitelist-crawl@linkedin.com.
Scraping LinkedIn Job Postings with JavaScript
Although scraping LinkedIn is legal, we clearly understand LinkedIn itself doesn’t want to be scraped, so you want to be respectful when building your bot.
One thing you’ll avoid on this project is using a headless browser to login into an account and access what would be considered private data. Instead, you’ll focus on scraping public LinkedIn data that doesn’t require us to trespass any login screen.
You’ll go into LinkedIn public job listing page and use Axios and Cheerio to download and parse the HTML to extract the job title, company, location, and URL of the listing.
Creating the project
You will create a new folder for this project in your windows explorer.
mkdir LinkedinScraper
cd LinkedinScraper
npm init
The command above initializes a project and creates a package.json
file where the packages you install will be kept. Click enter, and the package.json
file will be created. You will get a few prompts on the information you want the file to contain. Take note of the entry point created - index.js
.
Installing packages
ExpressJs is a backend framework for Node.js. You will be installing it to listen to PORTS i.e. the port you set for your server. To check if everything works perfectly. Go ahead and run:
npm i express
The command above installs the express dependency for your project.
Cheerio helps to parse markup, it is used to pick out HTML elements from a webpage. It provides an API that allows you to manipulate the resulting data structure.
Run:
npm install cheerio
This will install the Cheerio dependency in the package.json
file.
Axios is used to make HTTP requests. Run the command below to install the dependency.
npm install axios
This will also install the axios dependency in the package.json
file.
Open the package.json
file to see the installed packages.
The dependencies field contains the packages you have installed and their versions.
Getting Started
To get the ball rolling, inside the folder, let’s create an index.js
file and import your dependencies at the top:
const axios = require('axios');
const cheerio = require('cheerio');
Next, initialize express so that it listens to the PORT you want to use. Let’s say you decide to use PORT: 5000
, you should be able to know if the server is running or if it isn’t.
Edit the index.js
file to look like this:
const PORT = 5000;
const axios = require('axios');
const cheerio = require('cheerio');
const express = require('express');
const app = express();
app.listen(PORT, () => console.log(`server running on port ${PORT}`));
To check if your server is running on the assigned PORT, run:
npm run start
The display on the terminal should look like this:
Awesome! It works.
Note: You don’t always have to type
npm run start
when you make a change to your script,nodemon
takes care of reloading when you save your changes.
Now focusing on the actual scraping, get the URL
of the website you want to scrape.
Axios takes this URL, makes an HTTP request, and then returns response data. This response data can be displayed in the terminal.
Effect this in your index.js
:
const url = "`[`https://coinmarketcap.com/`](https://coinmarketcap.com/)";
axios.get(url).then((response) => {
const html_data = response.data;
const $ = cheerio.load(html_data);
})
From the code above, you will notice that the response gotten from the HTTP request is assigned to the variable html\_data
.
Understanding Cheerio
In the code snippet above, you loaded the HTML elements into Cheerio using the .load()
method and stored it in the $
variable, similar to jQuery. With the elements loaded you can retrieve DOM elements based on the data you need.
Cheerio makes it possible to navigate through the DOM elements and manipulate them, this is done by targeting tags, classes, ids, and hrefs. For example, an element with a class of submitButton
can be represented as $('.submitButton'), id
as $('#submitButton')
, and also pick an h1
element by using $('h1')
. Cheerio provides methods like find()
to find elements, each()
to iterate through elements, and the filter()
method, amongst others.
Parsing the HTML with Cheerio
Before parsing an HTML page, you must first inspect the structure of the page. In this case, you want to pick the name of each Job title, Company, Location, and URL.
To pick the right selectors, you can go back to the opened URL and take note of the attributes you can use. You noticed that all the data needed is inside of individual <li>
; elements, so target that first and store it into a constant
.
const jobs = $('li');
With all the listings stored inside jobs, you can now go one by one and extract the specific bits of data we’re looking for. You can test your selectors right inside the DevTools to avoid sending unnecessary requests to the server.
Scraping the LinkedIn Profile
There are many ways to navigate to the next page, but with your current knowledge of the website, the easiest way would be to increase the start parameter in the URL by 20 to display the next 25 jobs until there are no more results and a for loop meets this functionality perfectly.
const axios = require("axios");
const cheerio = require("cheerio");
linkedinJobs = [];
for (let pageNumber = 0; pageNumber < 1000; pageNumber += 25) {
let url = [https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit¤tJobId=2931031787&position=1&pageNum=0&start=${pageNumber}](https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit¤tJobId=2931031787&position=1&pageNum=0&start=${pageNumber});
axios(url)
.then((response) => {
const html = response.data;
const $ = cheerio.load(html);
const jobs = $("li");
jobs.each((index, element) => {
const jobTitle = $(element)
.find("h3.base-search-card__title")
.text()
.trim();
const company = $(element)
.find("h4.base-search-card__subtitle")
.text()
.trim();
const location = $(element)
.find("span.job-search-card__location")
.text()
.trim();
const link = $(element).find("a.base-card__full-link").attr("href");
linkedinJobs.push({
Title: jobTitle,
Company: company,
Location: location,
Link: link,
});
});
.catch(console.error);
}
The above code snippet did the following:
The starting point will be 0, as that’s the first value you want to pass to the
start
parameterBecause you know that when hitting 1000, there won’t be any more results, you want the code to run as long as
pageNumber
is less than 1000Finally, after every iteration of the code, you want to increase
pageNumber
by 25, effectively moving to the next pageUsed the
pageNumber
variable as the value for the start parameter.
Writing the LinkedIn Data to a CSV File
You’ll be using the Objects-to-CSV package. They have detailed documentation if you’d like to go deeper into the package. In simple terms, it will convert the array of JavaScript objects (linkedinJobs
) into a CSV format you can save to your machine.
First, you'll install the package using npm install objects-to-csv
and add it to the top of your project.
const ObjectsToCsv = require('objects-to-csv');
You can now add use the package right after closing your job.each()
method:
const csv = new ObjectsToCsv(linkedinJobs)
csv.toDisk('./linkedInJobs.csv', { append: true})
“The keys in the first object of the array will be used as column names” so it’s important that you make them descriptive when using the .push()
method.
Also, because you want to loop through several pages, you don’t want your CSV to be overwritten every time but to add the new data below. To do so, all you need to do is set append to true. It will only add the headers once and keep updating the file with the new data.
If you’ve followed along, here’s what your finished code should look like:
const axios = require("axios");
const cheerio = require("cheerio");
const ObjectsToCsv = require("objects-to-csv");
linkedinJobs = [];
for (let pageNumber = 0; pageNumber < 1000; pageNumber += 25) {
let url = [https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit¤tJobId=2931031787&position=1&pageNum=0&start=${pageNumber}](https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=email%2Bdeveloper&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit¤tJobId=2931031787&position=1&pageNum=0&start=${pageNumber});
axios(URL)
.then((response) => {
const html = response.data;
const $ = cheerio.load(html);
const jobs = $("li");
jobs.each((index, element) => {
const jobTitle = $(element)
.find("h3.base-search-card__title")
.text()
.trim();
const company = $(element)
.find("h4.base-search-card__subtitle")
.text()
.trim();
const location = $(element)
.find("span.job-search-card__location")
.text()
.trim();
const link = $(element).find("a.base-card__full-link").attr("href");
linkedinJobs.push({
Title: jobTitle,
Company: company,
Location: location,
Link: link,
});
});
const csv = new ObjectsToCsv(linkedinJobs);
csv.toDisk("./linkedInJobs.csv", { append: true });
})
.catch(console.error);
}
To run it, go to your terminal and type npm run start
Conclusion
In this project, you have learned how to scrape data from a LinkedIn job listing website. You have also become familiar with parsing HTML elements with Cheerio as well as manipulation. With this knowledge you can scrape through any website of your choice, but note that it is essential to first check for legal policies before scraping a site.