Web Scraping with Java Explained in Simpler Terms
Web scraping allows you to efficiently gather large amounts of data from the internet in a very fast manner and is particularly useful in cases where websites are not exposing their data in a structured way through the use of Application Programming Interfaces(API).
For instance, imagine you’re creating an application that compares the prices of items across e-commerce sites. How would you go about this? One way is to manually check the price of items yourself across all the sites and record your findings. However, this is not a smart way as there are thousands of products on e-commerce platforms, and it would take you forever to extract relevant data.
A better way to do this is through web scrapping. Web scraping is the process of automatically extracting data from web pages and websites through the use of software.
Software scripts, referred to as web scrapers, are used to access websites and retrieve data from the websites. The data retrieved, usually in an unstructured form, can then be analyzed and stored in a structured way that is meaningful to users.
Web Scraping in Data Extraction
Web scraping is very valuable in data extraction as it provides access to a wealth of data and allows for automation, such that you can schedule your web scraping script to run at certain times or in response to certain triggers. Web scraping also allows you to get real-time updates and makes it easy to conduct market research.

A lot of businesses and companies rely on web scraping to extract data for analysis. Companies specializing in human resources, e-commerce, finance, real estate, travel, social media, and research use web scraping to extract relevant data from websites.
Google itself uses web scraping to index websites on the internet so that it can provide relevant search results to users.
However, it is important to exercise caution when web scrapping. Although scrapping publicly accessible data is not illegal, some websites don’t allow the scraping. This could be because they have sensitive user information, their terms of service explicitly forbid web scrapping, or they are protecting intellectual property.
Additionally, some websites don’t allow web scraping as it can overload the website’s server and lead to increased bandwidth costs, especially when web scraping is done at scale.
To check if a website can be scrapped, append robots.txt to the website’s URL. robots.txt is used to indicate to bots which parts of the website can be scraped. For instance, to check if you can scrape Google, go to google.com/robots.txt

User-agent: * refers to all bots or software scripts and crawlers. Disallow is used to tell bots that they can’t access any URL under a directory, for instance /search. Allow indicates directories where they can access URLs from.
An example of a site that does not allow scraping is LinkedIn. To check if you can scrape LinkedIn, go to linkedin.com/robots.txt

As you can see, you are not allowed to scrape LinkedIn without their permission. Always check if a website allows scraping to avoid any legal issues.
Why Java Is a Suitable Language for Web Scraping
Whereas you can create a web scraper with a variety of programming languages, Java is particularly ideal for the job for a number of reasons. First, Java has a rich ecosystem and large community and provides a variety of web scraping libraries such as JSoup, WebMagic, and HTMLUnit, which make it easy to write web scrapers.

It also provides HTML Parsing Libraries to simplify the process of extracting data from HTML documents and networking libraries such as HttpURLConnection for making requests to different website URLs.
Java’s strong support for concurrency and multithreading is also beneficial in web scrapping as it allows for parallel processing and handling web scraping tasks with multiple requests, allowing you to scrape multiple pages simultaneously. With scalability being a key strength of Java, you can comfortably scrape websites at a massive scale using a web scraper written in Java.
Java’s cross-platform support also comes in handy as it allows you to write a web scraper and run it in any system that has a compatible Java Virtual Machine. Therefore, you can write a web scraper in one operating system or device and run it in a different operating system without the need to modify the web scraper.
Java can also be used with headless browsers such as Headless Chrome, HTML Unit, Headless Firefox, and PhantomJs, among others. A headless browser is a browser without a graphical user interface. Headless browsers can simulate user interactions and are very useful when scraping websites that require user interactions.
To cap it all, Java is a very popular and widely used language that is supported and can easily be integrated with a variety of tools such as databases and data processing frameworks. This is beneficial because it ensures that as you scrape data, all the tools that you will need for scraping, processing, and storing the data likely support Java.
Let us see how we can use Java for web scrapping.
Java for Web Scraping: Prerequisites
To use Java in web scraping, the following prerequisites should be fulfilled:
1. Java – you should have Java installed, preferably the latest long-term support version. In case you don’t have Java installed, go to install Java to learn how to install Java in your machine
2. Integrated Development Environment(IDE) – You should have an IDE installed on your machine. In this tutorial, we will use IntelliJ IDEA, but you can use any IDE you are familiar with.
3. Maven – this will be used for dependency management and to install a web scraping library.
In case you don’t have Maven installed, you can install it by opening the terminal and executing:
sudo apt install maven
This installs Maven from the official repository. You can confirm Maven was successfully installed by executing:
mvn -version
In case the installation was successful, you should get such an output:

Setting Up the Environment
To set up your environment:
1. Open IntelliJ IDEA. On the left menu bar, click on Projects, then select New Project.

2. In the New Project window that opens, fill it up as shown below. Make sure the Language is set to Java, and the Build System to Maven. You can give the project any name you prefer, then use Location to specify the folder where you want the Project created. Once done, click on Create.

3. Once your project is created, you should have a pom.xml in your project as shown below.

The pom.xml file is created by Maven and contains information about the project and configuration details used by Maven to build the project. It is this file that we also use to indicate that we will be using external libraries.
In building a web scraper, we will be using the jsoup library. We, therefore, need to add it as a dependency in the pom.xml file so that Maven can make it available in our project.
4. Add jsoup dependency in the pom.xml file by copying the code below and adding it to your pom.xml file
<dependencies>
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version>
</dependency>
</dependencies>
The result should be as shown below:

In case you encounter an error saying the dependency cannot be found, click on the indicated icon for Maven to load the changes made, load the dependency, and remove the error.
With that, your environment is fully set.
Web Scraping With Java
For web scraping, we are going to scrape data from ScrapeThisSite, which provides a sandbox where developers can practice web scraping without running into legal issues.
To scrape a website using Java
1. On the left-hand menu bar on IntelliJ, open the src directory, then the main directory, which is inside the src directory. The main directory contains a directory called java; right-click on it and select New, then Java Class

Give the class any name that you like, such as WebScraper, then press Enter to create a new Java Class.

Open the newly created file containing the Java classes you just created.
2. Web scraping involves getting data from websites. Therefore, we need to specify the URL from which we want to scrape data from. Once we specify the URL, we need to connect to the URL and make a GET request to fetch the HTML content of the page.
The code that does this is shown below:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class WebScraper {
public static void main(String[] args) {
String url = "https://www.scrapethissite.com/pages/simple/";
try {
Document doc = Jsoup.connect(url).get();
System.out.println(doc);
} catch (IOException e) {
System.out.println("An IOException occurred. Please try again.");
}
}
}
Output:

As you can see, the HTML of the page is returned and is what we are printing out. When scraping, the URL you specify may have an error, and the resource you are trying to scrape may not exist at all. That is why it is important to wrap our code in a try-catch statement.
The line:
Document doc = Jsoup.connect(url).get();
Is used to connect to connect to the URL you want to scrape. The get() method is used to make a GET request and fetch the HTML on the page. The returned result is then stored in a JSOUP Document object, named doc. Storing the result in a JSOUP document allows you to use the JSOUP API to manipulate the returned HTML.
3. Go to ScrapeThisSite and inspect the page. In the HTML, you should see the structure shown below:

Notice that all the countries on the page are stored under a similar structure. There is a div with a class called country with an h3 element with a class of country-name containing the name of each country on the page.
Inside the main div, there’s another div with a class of country-info, and it contains information such as capital, population, and country area. We can use these class names to select the HTML elements and extract information from them.
4. Extract specific content from the HTML on the page using the following lines:
Elements countries = doc.select(".country");
for (Element country : countries) {
String countryName = country.select(".country-name").text();
String capitalCity = country.select(".country-capital").text();
String population = country.select(".country-population").text();
System.out.println(countryName + " - " + capitalCity + " Population - " + population);
}
We are using the method select() to select elements from the HTML of the page that match the specific CSS selector we pass in. In our case, we pass in the class names. From inspecting the page, we saw that all the country information on the page is stored under a div with a class of country.
Each country has its own div with a class of country and the div contains information such as the country name, capital city, and population.
Therefore, we first select all the countries on the page using the class .country. We then store this in a variable called countries of type Elements, which works just like a list. We then use a for-loop to go through countries and extract the country name, capital city, and population and print out what is found.
Our entire codebase is shown below:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Element;
import java.io.IOException;
public class WebScraper {
public static void main(String[] args) {
String url = "https://www.scrapethissite.com/pages/simple/";
try {
Document doc = Jsoup.connect(url).get();
Elements countries = doc.select(".country");
for (Element country : countries) {
String countryName = country.select(".country-name").text();
String capitalCity = country.select(".country-capital").text();
String population = country.select(".country-population").text();
System.out.println(countryName + " - " + capitalCity + " - Population - " + population);
}
} catch (IOException e) {
System.out.println("An IOException occurred. Please try again.");
}
}
}
Output:

With the information we get back from the page, we can do a variety of things, such as print it out as we just did or store it in a file in case we want to do further data processing.
Conclusion
Web scraping is an excellent way to extract unstructured data from websites, store the data in a structured way, and process the data to extract meaningful information. However, it is important to exercise caution when web scraping, as certain websites don’t allow web scraping.
To be on the safe side, use websites that provide sandboxes to practice scrapping. Otherwise, always inspect the robots.txt of every website you want to scrape to find out if the website allows scrapping.
when writing web scrapper, Java is an excellent language as it provides libraries that make web scraping easier and more efficient. As a Java developer, building a web scraper will help you develop your programming skills even further. So go ahead and write your own web scrapper or modify the one used in the article to extract different kinds of information. Happy coding!
You may also explore some popular Cloud-based web scraping solutions.