Web Scraping In Java With Jsoup: A Step-By-Step Guide

Web scraping is the process of automatically extracting data and information from websites. It allows you to obtain structured data from web pages which can then be used for analysis, research, building datasets, and more.

Java, being a popular and versatile programming language, provides several libraries for web scraping. In this guide, we‘ll take an in-depth look at using the Jsoup library to build a web scraper from scratch. By the end, you‘ll have a working web scraper that can extract data from an entire website and output it neatly to a CSV file.

We‘ll cover:

What is Jsoup and why use it?
Setting up your Java project
Connecting to web pages
Inspecting the HTML to find what to scrape
Selecting elements using CSS selectors
Extracting text, attributes, and HTML from elements
Following pagination to scrape all pages
Exporting scraped data to CSV format
Tips to avoid getting blocked while scraping
Alternatives if you don‘t want to code the scraper yourself

Let‘s jump right in!

What is Jsoup?

Jsoup is an open-source Java library for extracting and manipulating data from HTML documents. It provides a convenient API for fetching URLs and parsing HTML into a traversable DOM (Document Object Model). You can then navigate and search the DOM using methods similar to jQuery to find the elements and data you‘re interested in.

Some standout features of Jsoup include:

Parse HTML from a URL, file, or string
Find and extract data using DOM traversal or CSS selectors
Manipulate the HTML elements, attributes, and text
Clean user-submitted content against a safe whitelist to prevent XSS attacks
Output tidy HTML

Jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag soup. It‘s an excellent choice for scraping, parsing, and cleaning websites.

Web Scraping Tutorial with Jsoup

Now let‘s get hands-on and build a web scraper using Jsoup! We‘ll scrape quotes from the website "Quotes to Scrape", a sandbox site set up specifically for testing web scrapers.

Here‘s what the target site looks like:
[Image of Quotes to Scrape homepage]

Our scraper will navigate through all the pages, extracting the quote text, author, and tags from each quote. Finally it will save all the scraped data in structured CSV format.

Setting Up

First make sure you have the Java Development Kit (JDK) installed. We‘ll be using Java 8+.

Next set up a new Java project in your favorite IDE. I‘m using IntelliJ IDEA.

Jsoup is available via Maven, so add this dependency to your pom.xml:

org.jsoup jsoup 1.13.1

Great, we‘re ready to start coding the scraper!

Connecting to the Target Page

Jsoup makes it very easy to fetch the HTML from a webpage. Just use the static Jsoup.connect() method and provide the URL:

Document doc = Jsoup.connect("https://quotes.toscrape.com/").get();

This sends an HTTP request to the specified URL (https://quotes.toscrape.com/ in this case), parses the returned HTML, and provides a Document object that we can work with.

Jsoup.connect supports various other options to control the request:

userAgent(String): Set the user-agent to send with requests
timeout(int): Set the request timeout in milliseconds
headers(Map<String, String>): Set custom header fields
cookies(Map<String, String>): Set cookies to send with the request

For example, some sites require a valid User-Agent header or they‘ll reject the request. To set it:

Document doc = Jsoup.connect("https://quotes.toscrape.com/") .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36") .get();

Analyzing the HTML Structure

In order to extract the desired data from the HTML page, we first need to inspect the page‘s structure to see how the data is laid out. Most modern web browsers provide built-in developer tools to make this easy.

Right-click on one of the quotes and select "Inspect" to open the developer tools. You should see something like this:

[Image of developer tools inspecting quote HTML]

Examining this, we can see that each quote is contained in a

element with the class "quote". Within these divs, the quote text is in a , the author in a , and tags in elements.

Great, now that we know how the quotes are structured, let‘s select and extract them!

Selecting Elements with CSS Selectors

Jsoup provides convenient methods to find elements using CSS selectors, just like in JavaScript. If you‘re not familiar with CSS selectors, W3Schools has a great reference.

To select all elements with the class "quote":

Elements quotes = doc.select(".quote");

This returns an Elements collection with all the quote

s that we can loop through.

for (Element quote : quotes) { // do something with each quote element }

To drill down further into each quote:

String quoteText = quote.select(".text").text(); String quoteAuthor = quote.select(".author").text();

Here we‘re selecting the quote and author elements within each quote div, then extracting their text content.

For the tags, there are multiple elements. We can select them all into an Elements collection again:

Elements quoteTags = quote.select(".tag");

Then loop through to get the individual tag strings:

List tags = new ArrayList<>();

for (Element quoteTag : quoteTags) { tags.add(quoteTag.text()); }

At this point, we‘ve successfully extracted all the data we want from a single page of quotes. But there are multiple pages! Let‘s use pagination to scrape them all.

Handling Pagination

The quotes site uses a typical pagination system with "Next" links at the bottom of each page. By following those links, we can scrape every page.

To select the Next link:

Element next = doc.select(".next a").first();

This finds the first element within an element with class "next". The next step is extracting the URL to the next page from the attribute. But first we should check if there actually is a Next link (on the last page there isn‘t):

if (next != null) { String nextUrl = next.attr("href"); } else { // we‘re on the last page }

Putting it all together into a loop:

String quotesUrl = "https://quotes.toscrape.com/";

while (true) { // scrape quotes from current page // ...

// check for next page  
Element next = doc.select(".next a").first();

if (next != null) {  
    // there is a next page
    String nextUrl = next.attr("href");
    quotesUrl = "https://quotes.toscrape.com" + nextUrl;
    doc = Jsoup.connect(quotesUrl).get();    
} else {
    // we‘re on the last page
    break;  
}

}

We start on the base URL "https://quotes.toscrape.com/". After scraping the quotes on the current page, we look for the Next link. If it exists, we extract its href value, concatenate it onto the base URL (since it‘s a relative URL), and fetch the next page. Once we reach the last page, the loop exits.

Now we‘re able to scrape all pages of quotes from the site!

Outputting Data to CSV

Finally, let‘s save our scraped data to a CSV file for easy use in other analysis and tools. We‘ll create a basic nested class to represent each "Quote" and store the data:

private static class Quote { String text; String author; List tags;

public Quote(String text, String author, List<String> tags) {
    this.text = text;  
    this.author = author;
    this.tags = tags;  
}

}

Then after each quote is scraped, create a new Quote object and add it to a List:

List allQuotes = new ArrayList<>();

// ...

Quote quote = new Quote(quoteText, quoteAuthor, tags); allQuotes.add(quote);

Finally to output the quotes data to CSV:

try (PrintWriter pw = new PrintWriter("quotes.csv")) { pw.println("Quote|Author|Tags");

for (Quote quote : allQuotes) {
    String tags = String.join(",", quote.tags);
    pw.println(quote.text + "|" + quote.author + "|" + tags);
}

}

This writes each Quote‘s data to a file "quotes.csv" with pipe "|" used as a delimiter between the fields (since the quote text itself may contain commas) and each tag separated by commas.

And there we have it! A fully functioning web scraper built with Jsoup. Some additional things to consider:

Avoiding Getting Blocked When Web Scraping

Some websites don‘t like being scraped and will attempt to block requests they identify as coming from a scraper. Some techniques they use:

User-Agent detection: Blocking requests with blank, missing, or known bot User-Agents
Rate limiting: Blocking IPs that send too many requests too quickly
Honeypot traps: Hiding links that only bots can find in order to catch and block them

To avoid these, try to make your scraper behave more like a normal user:

Set a common browser User-Agent like we did above
Introduce random delays between requests to simulate reading time
Rotate proxy IPs and user-agents if scraping large amounts from individual sites
Respect robots.txt if present

Alternatives to DIY Web Scraping

Depending on your specific needs, writing and running your own web scraper may not be necessary:

Many sites offer APIs providing their data in structured formats
Pre-made datasets are increasingly available covering common scraping needs
Third-party scraping services and tools handle the scraping and data cleaning for you

If you‘re looking for a large amount of web data without the hassle of scraping it yourself, it‘s worth checking if a pre-made dataset or API already exists. Bright Data, for example, offers datasets on many topics as well as customized data collection services.

Conclusion

Web scraping is an incredibly useful skill, allowing you to collect data otherwise locked within websites. With the Jsoup library, Java developers can scrape websites with just a few lines of code.

In this guide, we covered all the steps to building a complete web scraper in Java:

Setting up Jsoup in your project
Fetching web pages and parsing the HTML
Navigating the HTML document using CSS selectors to find the desired elements
Extracting text, attributes, and HTML from elements
Detecting and following pagination links to scrape all pages
Saving the scraped data into a structured format like CSV
Tips for avoiding anti-scraping countermeasures
Alternatives for cases where pre-made datasets or scraping services are preferable

You should now have a solid foundation for scraping websites using Java and Jsoup. Remember to always be respectful when scraping and abide by the rules in robots.txt. Happy scraping!

If you want to learn more, check out our other articles on web scraping, Java, and working with data. And feel free to leave any questions or thoughts in the comments below!

Handling Pagination

Outputting Data to CSV

Avoiding Getting Blocked When Web Scraping

Alternatives to DIY Web Scraping

Conclusion

Web Scraping in Java With Jsoup: A Step-By-Step Guide

What is Jsoup?

Web Scraping Tutorial with Jsoup

Setting Up

Connecting to the Target Page

Analyzing the HTML Structure

Selecting Elements with CSS Selectors

Big Data Analytics Explained: Benefits & Use Cases

Residential IPs for Business: 4 Critical Questions to Ask Providers

Why They‘re Called "Proxies": The Fascinating Etymology Behind Web Scraping‘s Most Important Tool

How Public Online Data is Helping Retailers Win the Holidays

The Cost Effectiveness of Residential IPs for Web Scraping

The Ultimate Guide to Social Media Data Collection for Influencer Marketing

Our Mission

The Ultimate Guide to Web Scraping with C#: Tools, Techniques, and Real-World Examples

The Ultimate Guide to Web Scraping with Puppeteer

Web Scraping with PHP: The Ultimate Guide

The Ultimate Guide to Web Scraping with C#: Tools, Techniques, and Real-World Examples

The Ultimate Guide to Web Scraping with Puppeteer

Web Scraping with PHP: The Ultimate Guide

The Ultimate Guide to Web Scraping with C#: Tools, Techniques, and Real-World Examples

The Ultimate Guide to Web Scraping with Puppeteer

Web Scraping with PHP: The Ultimate Guide

Expert Opinion

What is Jsoup?

Web Scraping Tutorial with Jsoup

Setting Up

Connecting to the Target Page

Analyzing the HTML Structure

Selecting Elements with CSS Selectors

Outputting Data to CSV

Avoiding Getting Blocked When Web Scraping

Alternatives to DIY Web Scraping

Conclusion

Similar Posts

Expert Opinion