Web Scraping With Scrapy: A Comprehensive Guide

Web scraping is the process of programmatically extracting data from websites. It allows you to collect and structure information from across the internet efficiently and at scale. Some common use cases for web scraping include market research, price monitoring, lead generation, competitor analysis, and more.

While you can write web scrapers from scratch using libraries like BeautifulSoup and requests, a more robust and extensible option is to use a dedicated web scraping framework like Scrapy. Scrapy is an open source Python framework that provides a complete ecosystem of tools and utilities for harvesting data from the web.

Some key benefits of using Scrapy for web scraping include:

  • Built-in support for extracting data using CSS and XPath selectors
  • Interactive shell console for trying out CSS and XPath expressions
  • Ability to crawl and follow links to scrape entire domains
  • Built-in extensions and middleware for handling cookies, sessions, authentication, and more
  • Easy export of scraped data to JSON, CSV, and other formats
  • Great performance thanks to asynchronous networking and multiprocessing

In this in-depth tutorial, we‘ll walk through how to use Scrapy to scrape school-related information like homework assignments and lunch menus from the web. By the end, you‘ll have a fully functional web scraper that you can adapt for your own data needs. Let‘s get started!

Setting Up a Scrapy Project

Before we start writing any code, we need to create a new Scrapy project. Make sure you have Scrapy installed, then open a terminal and run:

scrapy startproject school_scraper

This will generate a new directory called school_scraper with the following structure:

school_scraper/
    scrapy.cfg            # deploy configuration file

    school_scraper/       # project‘s Python module, you‘ll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you‘ll later put your spiders
            __init__.py

The spiders directory is where we‘ll define our spider classes to scrape data. Speaking of which, let‘s create a spider to extract homework assignments.

Creating a Homework Spider

In Scrapy, spiders are classes that define how a certain site (or group of sites) will be scraped. They include instructions on which pages to scrape, which data to extract, and how to extract it.

To generate a new spider, navigate into the spiders directory and run:

scrapy genspider homework_spider systemcraftsman.github.io/scrapy-demo/website/index.html  

This will create a new file called homework_spider.py with a basic spider template:

import scrapy

class HomeworkSpiderSpider(scrapy.Spider):
    name = ‘homework_spider‘
    allowed_domains = [‘systemcraftsman.github.io‘]
    start_urls = [‘http://systemcraftsman.github.io/scrapy-demo/website/index.html‘]

    def parse(self, response):
        pass

The parse method is called whenever the spider fetches a new page. It takes the page response as an argument and is responsible for extracting the desired data and/or following links to other pages.

In our case, the first thing our spider needs to do is log into the website. We can do this by submitting a form request in the parse method:

def parse(self, response):
    formdata = {‘username‘: ‘student‘, ‘password‘: ‘12345‘}
    return scrapy.FormRequest(
        url=‘https://systemcraftsman.github.io/scrapy-demo/website/welcome.html‘, 
        method=‘GET‘,
        formdata=formdata, 
        callback=self.after_login
    )

This code submits a GET request to the welcome page URL with the provided form data (dummy student credentials). The callback argument specifies the method that will be called to handle the response – in this case after_login.

Let‘s define that after_login method now:

def after_login(self, response):
    if response.status == 200:
        return scrapy.Request(
            url=self.homework_page_url,
            callback=self.parse_homework_page
        )

This checks that the login was successful based on the status code, then follows the link to the homework assignments page. Again, it uses a callback to specify parse_homework_page as the handler for the next response.

Finally, we can extract the actual homework assignment data in the parse_homework_page method using XPath expressions:

def parse_homework_page(self, response):
    rows = response.xpath(‘//table[@class="table"]/tr‘)

    data = {}
    for row in rows[1:]:
        columns = row.xpath(‘td‘)

        date = columns[0].xpath(‘text()‘).get()
        if date == ‘12.03.2024‘:
            subject = columns[1].xpath(‘text()‘).get() 
            assignment = columns[2].xpath(‘text()‘).get()

            data[subject] = assignment

    yield data

This navigates to the homework table on the page, skips the header row, and iterates through each data row. It extracts the date, subject, and assignment details using relative XPath expressions and stores them in a dictionary.

Once all rows are processed, the yield statement emits the scraped data for further handling. Scrapy will automatically pass this data through any defined item pipelines and exporters.

Here‘s the complete code for our HomeworkSpider:

import scrapy

class HomeworkSpider(scrapy.Spider):
    name = ‘homework_spider‘
    allowed_domains = [‘systemcraftsman.github.io‘]
    start_urls = [‘https://systemcraftsman.github.io/scrapy-demo/website/index.html‘]

    homework_page_url = ‘https://systemcraftsman.github.io/scrapy-demo/website/homeworks.html‘

    def parse(self, response):
        formdata = {‘username‘: ‘student‘, ‘password‘: ‘12345‘}
        return scrapy.FormRequest(
            url=‘https://systemcraftsman.github.io/scrapy-demo/website/welcome.html‘,
            method=‘GET‘, 
            formdata=formdata,
            callback=self.after_login
        )

    def after_login(self, response):
        if response.status == 200:
            return scrapy.Request(
                url=self.homework_page_url,
                callback=self.parse_homework_page  
            )

    def parse_homework_page(self, response):
        rows = response.xpath(‘//table[@class="table"]/tr‘)

        data = {}  
        for row in rows[1:]:
            columns = row.xpath(‘td‘)

            date = columns[0].xpath(‘text()‘).get()
            if date == ‘12.03.2024‘:  
                subject = columns[1].xpath(‘text()‘).get()
                assignment = columns[2].xpath(‘text()‘).get()

                data[subject] = assignment

        yield data

Now let‘s test it out! Run the spider using:

scrapy crawl homework_spider -o homework.json 

If everything works, you should see the scraped homework assignments exported to a homework.json file in the project root.

Creating a Meal List Spider

We can use a very similar process to create a spider for extracting school lunch menu data. I‘ll spare you the details, but here‘s the full code:

import scrapy

class MealSpider(scrapy.Spider):
    name = ‘meal_spider‘  
    allowed_domains = [‘systemcraftsman.github.io‘]
    start_urls = [‘https://systemcraftsman.github.io/scrapy-demo/website/index.html‘] 

    meal_page_url = ‘https://systemcraftsman.github.io/scrapy-demo/website/meal-list.html‘

    def parse(self, response):
        formdata = {‘username‘: ‘student‘, ‘password‘: ‘12345‘}
        return scrapy.FormRequest(
            url=‘https://systemcraftsman.github.io/scrapy-demo/website/welcome.html‘,
            method=‘GET‘,
            formdata=formdata,
            callback=self.after_login 
        )

    def after_login(self, response):
        if response.status == 200:
            return scrapy.Request(
                url=self.meal_page_url,
                callback=self.parse_meal_page
            )  

    def parse_meal_page(self, response):
        date = ‘13.03.2024‘
        col_index = 6   # Monday column

        data = {
            ‘Breakfast‘: response.xpath(f‘//tr[contains(., "BREAKFAST")]/td[{col_index}]/text()‘).getall(),
            ‘Lunch‘: response.xpath(f‘//tr[contains(., "LUNCH")]/td[{col_index}]/text()‘).getall(),  
            ‘Salad & Dessert‘: response.xpath(f‘//tr[contains(., "SALAD/DESSERT")]/td[{col_index}]/text()‘).getall(),
            ‘Fruit Time‘: response.xpath(f‘//tr[contains(., "FRUIT TIME")]/td[{col_index}]/text()‘).getall(),
        }

        yield data

The main difference is in the parse_meal_page method. Here we locate the different meal sections in the table by searching for specific text in each row. We then extract the contents of the corresponding cell for the desired date.

Run this spider with scrapy crawl meal_spider -o meals.json to export the scraped meal data.

Advanced Tips for Web Scraping with Scrapy

While our example scraped a basic static website, real-world web scraping often involves additional challenges like dynamic content, authentication, IP blocking, and more. Here are some tips for handling these situations in Scrapy:

Dynamic Websites

Some websites render content dynamically using JavaScript. This means the data you want to scrape may not exist in the initial HTML response. There are a few ways to handle this:

  • Use Scrapy‘s SplashRequest with a headless browser like Splash to execute JS and get the rendered DOM
  • Reverse engineer the AJAX calls used to fetch dynamic data and replicate them in your spider
  • Run a full browser like Puppeteer or Selenium to load pages, then pass the HTML to Scrapy for parsing

CAPTCHAs

CAPTCHAs are designed to prevent bots like scrapers from automatically submitting forms and accessing certain pages. Some solutions include:

  • Using a CAPTCHA solving service that provides APIs to recognize CAPTCHAs in scraped pages
  • Training your own ML model to solve the specific type of CAPTCHA used on the target site
  • Detecting CAPTCHAs in middleware and presenting them to a human to solve manually

Managing Cookies and Sessions

Scrapy automatically handles cookies for you, persisting them between requests as needed. You can configure cookie handling with settings like COOKIES_ENABLED, COOKIES_DEBUG, and COOKIES_PERSISTENCE.

For more custom behavior, you can subclass Scrapy‘s default cookie middleware or write your own from scratch. This allows you to do things like:

  • Save and reuse cookies across spider runs
  • Set custom cookies (e.g. for authentication)
  • Implement session handling logic

Avoiding IP Bans

Scraping too aggressively or frequently from one IP can get you banned from a website. To avoid this:

  • Slow down your crawl speed with DOWNLOAD_DELAY and AUTOTHROTTLE_ENABLED settings
  • Distribute requests over a pool of rotating proxy IP addresses
  • Use a headless browser to better simulate human behavior
  • Set a custom User-Agent header to avoid looking like a scraper

Bright Data Integration

For even more power and flexibility, you can integrate Scrapy with Bright Data – a leading web data platform. Bright Data‘s tools are compatible with Scrapy and can help you:

  • Access a huge pool of proxy IPs from every country and city in the world
  • Automatically retry failed requests and rotate IPs to avoid bans
  • Collect data even from sites that are difficult to crawl due to anti-bot measures
  • Render JS heavy pages and solve CAPTCHAs automatically

Check out the Bright Data website to learn more and start a free trial.

Conclusion

Web scraping with Scrapy is a powerful way to extract data from websites at scale. In this guide, we covered how to:

  • Create a new Scrapy project
  • Define spider classes to crawl and parse web pages
  • Extract data from HTML using CSS and XPath selectors
  • Format, clean, and export scraped data
  • Handle authentication, dynamic content, IP bans, and more

We walked through a practical example of scraping school-related data like homework assignments and lunch menus. But the same techniques can be adapted for a wide variety of use cases.

With a solid foundation in Scrapy and some practice, you‘ll be able to efficiently collect web data for all kinds of applications like market research, news aggregation, real estate listings, and more. Thanks for reading!

Similar Posts