Web Scraping in Java With Jsoup: A Step-By-Step Guide

Web scraping is the process of automatically extracting data and information from websites. It allows you to obtain structured data from web pages which can then be used for analysis, research, building datasets, and more.

Java, being a popular and versatile programming language, provides several libraries for web scraping. In this guide, we‘ll take an in-depth look at using the Jsoup library to build a web scraper from scratch. By the end, you‘ll have a working web scraper that can extract data from an entire website and output it neatly to a CSV file.

We‘ll cover:

  • What is Jsoup and why use it?
  • Setting up your Java project
  • Connecting to web pages
  • Inspecting the HTML to find what to scrape
  • Selecting elements using CSS selectors
  • Extracting text, attributes, and HTML from elements
  • Following pagination to scrape all pages
  • Exporting scraped data to CSV format
  • Tips to avoid getting blocked while scraping
  • Alternatives if you don‘t want to code the scraper yourself

Let‘s jump right in!

What is Jsoup?

Jsoup is an open-source Java library for extracting and manipulating data from HTML documents. It provides a convenient API for fetching URLs and parsing HTML into a traversable DOM (Document Object Model). You can then navigate and search the DOM using methods similar to jQuery to find the elements and data you‘re interested in.

Some standout features of Jsoup include:

  • Parse HTML from a URL, file, or string
  • Find and extract data using DOM traversal or CSS selectors
  • Manipulate the HTML elements, attributes, and text
  • Clean user-submitted content against a safe whitelist to prevent XSS attacks
  • Output tidy HTML

Jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag soup. It‘s an excellent choice for scraping, parsing, and cleaning websites.

Web Scraping Tutorial with Jsoup

Now let‘s get hands-on and build a web scraper using Jsoup! We‘ll scrape quotes from the website "Quotes to Scrape", a sandbox site set up specifically for testing web scrapers.

Here‘s what the target site looks like:
[Image of Quotes to Scrape homepage]

Our scraper will navigate through all the pages, extracting the quote text, author, and tags from each quote. Finally it will save all the scraped data in structured CSV format.

Setting Up

First make sure you have the Java Development Kit (JDK) installed. We‘ll be using Java 8+.

Next set up a new Java project in your favorite IDE. I‘m using IntelliJ IDEA.

Jsoup is available via Maven, so add this dependency to your pom.xml:

org.jsoup
jsoup
1.13.1

Great, we‘re ready to start coding the scraper!

Connecting to the Target Page

Jsoup makes it very easy to fetch the HTML from a webpage. Just use the static Jsoup.connect() method and provide the URL:


Document doc = Jsoup.connect("https://quotes.toscrape.com/").get();

This sends an HTTP request to the specified URL (https://quotes.toscrape.com/ in this case), parses the returned HTML, and provides a Document object that we can work with.

Jsoup.connect supports various other options to control the request:

  • userAgent(String): Set the user-agent to send with requests
  • timeout(int): Set the request timeout in milliseconds
  • headers(Map<String, String>): Set custom header fields
  • cookies(Map<String, String>): Set cookies to send with the request

For example, some sites require a valid User-Agent header or they‘ll reject the request. To set it:


Document doc = Jsoup.connect("https://quotes.toscrape.com/")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
.get();

Analyzing the HTML Structure

In order to extract the desired data from the HTML page, we first need to inspect the page‘s structure to see how the data is laid out. Most modern web browsers provide built-in developer tools to make this easy.

Right-click on one of the quotes and select "Inspect" to open the developer tools. You should see something like this:

[Image of developer tools inspecting quote HTML]

Examining this, we can see that each quote is contained in a

element with the class "quote". Within these divs, the quote text is in a , the author in a , and tags in elements.

Great, now that we know how the quotes are structured, let‘s select and extract them!

Selecting Elements with CSS Selectors

Jsoup provides convenient methods to find elements using CSS selectors, just like in JavaScript. If you‘re not familiar with CSS selectors, W3Schools has a great reference.

To select all elements with the class "quote":


Elements quotes = doc.select(".quote");

This returns an Elements collection with all the quote

s that we can loop through.


for (Element quote : quotes) {
// do something with each quote element
}

To drill down further into each quote:


String quoteText = quote.select(".text").text();
String quoteAuthor = quote.select(".author").text();

Here we‘re selecting the quote and author elements within each quote div, then extracting their text content.

For the tags, there are multiple elements. We can select them all into an Elements collection again:


Elements quoteTags = quote.select(".tag");

Then loop through to get the individual tag strings:


List tags = new ArrayList<>();

for (Element quoteTag : quoteTags) {
tags.add(quoteTag.text());
}

At this point, we‘ve successfully extracted all the data we want from a single page of quotes. But there are multiple pages! Let‘s use pagination to scrape them all.

Handling Pagination

The quotes site uses a typical pagination system with "Next" links at the bottom of each page. By following those links, we can scrape every page.

To select the Next link:


Element next = doc.select(".next a").first();

This finds the first element within an element with class "next". The next step is extracting the URL to the next page from the attribute. But first we should check if there actually is a Next link (on the last page there isn‘t):


if (next != null) {
String nextUrl = next.attr("href");
} else {
// we‘re on the last page
}

Putting it all together into a loop:


String quotesUrl = "
https://quotes.toscrape.com/";

while (true) {
// scrape quotes from current page
// ...

// check for next page  
Element next = doc.select(".next a").first();

if (next != null) {  
    // there is a next page
    String nextUrl = next.attr("href");
    quotesUrl = "https://quotes.toscrape.com" + nextUrl;
    doc = Jsoup.connect(quotesUrl).get();    
} else {
    // we‘re on the last page
    break;  
}

}

We start on the base URL "https://quotes.toscrape.com/". After scraping the quotes on the current page, we look for the Next link. If it exists, we extract its href value, concatenate it onto the base URL (since it‘s a relative URL), and fetch the next page. Once we reach the last page, the loop exits.

Now we‘re able to scrape all pages of quotes from the site!

Outputting Data to CSV

Finally, let‘s save our scraped data to a CSV file for easy use in other analysis and tools. We‘ll create a basic nested class to represent each "Quote" and store the data:


private static class Quote {
String text;
String author;
List tags;

public Quote(String text, String author, List<String> tags) {
    this.text = text;  
    this.author = author;
    this.tags = tags;  
}

}

Then after each quote is scraped, create a new Quote object and add it to a List:


List allQuotes = new ArrayList<>();

// ...

Quote quote = new Quote(quoteText, quoteAuthor, tags);
allQuotes.add(quote);

Finally to output the quotes data to CSV:


try (PrintWriter pw = new PrintWriter("quotes.csv")) {
pw.println("Quote|Author|Tags");

for (Quote quote : allQuotes) {
    String tags = String.join(",", quote.tags);
    pw.println(quote.text + "|" + quote.author + "|" + tags);
}

}

This writes each Quote‘s data to a file "quotes.csv" with pipe "|" used as a delimiter between the fields (since the quote text itself may contain commas) and each tag separated by commas.

And there we have it! A fully functioning web scraper built with Jsoup. Some additional things to consider:

Avoiding Getting Blocked When Web Scraping

Some websites don‘t like being scraped and will attempt to block requests they identify as coming from a scraper. Some techniques they use:

  • User-Agent detection: Blocking requests with blank, missing, or known bot User-Agents
  • Rate limiting: Blocking IPs that send too many requests too quickly
  • Honeypot traps: Hiding links that only bots can find in order to catch and block them

To avoid these, try to make your scraper behave more like a normal user:

  • Set a common browser User-Agent like we did above
  • Introduce random delays between requests to simulate reading time
  • Rotate proxy IPs and user-agents if scraping large amounts from individual sites
  • Respect robots.txt if present

Alternatives to DIY Web Scraping

Depending on your specific needs, writing and running your own web scraper may not be necessary:

  • Many sites offer APIs providing their data in structured formats
  • Pre-made datasets are increasingly available covering common scraping needs
  • Third-party scraping services and tools handle the scraping and data cleaning for you

If you‘re looking for a large amount of web data without the hassle of scraping it yourself, it‘s worth checking if a pre-made dataset or API already exists. Bright Data, for example, offers datasets on many topics as well as customized data collection services.

Conclusion

Web scraping is an incredibly useful skill, allowing you to collect data otherwise locked within websites. With the Jsoup library, Java developers can scrape websites with just a few lines of code.

In this guide, we covered all the steps to building a complete web scraper in Java:

  • Setting up Jsoup in your project
  • Fetching web pages and parsing the HTML
  • Navigating the HTML document using CSS selectors to find the desired elements
  • Extracting text, attributes, and HTML from elements
  • Detecting and following pagination links to scrape all pages
  • Saving the scraped data into a structured format like CSV
  • Tips for avoiding anti-scraping countermeasures
  • Alternatives for cases where pre-made datasets or scraping services are preferable

You should now have a solid foundation for scraping websites using Java and Jsoup. Remember to always be respectful when scraping and abide by the rules in robots.txt. Happy scraping!

If you want to learn more, check out our other articles on web scraping, Java, and working with data. And feel free to leave any questions or thoughts in the comments below!

Similar Posts