Web Scraping With Scrapy: A Comprehensive Guide
Web scraping is the process of programmatically extracting data from websites. It allows you to collect and structure information from across the internet efficiently and at scale. Some common use cases for web scraping include market research, price monitoring, lead generation, competitor analysis, and more.
While you can write web scrapers from scratch using libraries like BeautifulSoup and requests, a more robust and extensible option is to use a dedicated web scraping framework like Scrapy. Scrapy is an open source Python framework that provides a complete ecosystem of tools and utilities for harvesting data from the web.
Some key benefits of using Scrapy for web scraping include:
- Built-in support for extracting data using CSS and XPath selectors
- Interactive shell console for trying out CSS and XPath expressions
- Ability to crawl and follow links to scrape entire domains
- Built-in extensions and middleware for handling cookies, sessions, authentication, and more
- Easy export of scraped data to JSON, CSV, and other formats
- Great performance thanks to asynchronous networking and multiprocessing
In this in-depth tutorial, we‘ll walk through how to use Scrapy to scrape school-related information like homework assignments and lunch menus from the web. By the end, you‘ll have a fully functional web scraper that you can adapt for your own data needs. Let‘s get started!
Setting Up a Scrapy Project
Before we start writing any code, we need to create a new Scrapy project. Make sure you have Scrapy installed, then open a terminal and run:
scrapy startproject school_scraper
This will generate a new directory called school_scraper
with the following structure:
school_scraper/
scrapy.cfg # deploy configuration file
school_scraper/ # project‘s Python module, you‘ll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you‘ll later put your spiders
__init__.py
The spiders
directory is where we‘ll define our spider classes to scrape data. Speaking of which, let‘s create a spider to extract homework assignments.
Creating a Homework Spider
In Scrapy, spiders are classes that define how a certain site (or group of sites) will be scraped. They include instructions on which pages to scrape, which data to extract, and how to extract it.
To generate a new spider, navigate into the spiders
directory and run:
scrapy genspider homework_spider systemcraftsman.github.io/scrapy-demo/website/index.html
This will create a new file called homework_spider.py
with a basic spider template:
import scrapy
class HomeworkSpiderSpider(scrapy.Spider):
name = ‘homework_spider‘
allowed_domains = [‘systemcraftsman.github.io‘]
start_urls = [‘http://systemcraftsman.github.io/scrapy-demo/website/index.html‘]
def parse(self, response):
pass
The parse
method is called whenever the spider fetches a new page. It takes the page response as an argument and is responsible for extracting the desired data and/or following links to other pages.
In our case, the first thing our spider needs to do is log into the website. We can do this by submitting a form request in the parse
method:
def parse(self, response):
formdata = {‘username‘: ‘student‘, ‘password‘: ‘12345‘}
return scrapy.FormRequest(
url=‘https://systemcraftsman.github.io/scrapy-demo/website/welcome.html‘,
method=‘GET‘,
formdata=formdata,
callback=self.after_login
)
This code submits a GET request to the welcome page URL with the provided form data (dummy student credentials). The callback
argument specifies the method that will be called to handle the response – in this case after_login
.
Let‘s define that after_login
method now:
def after_login(self, response):
if response.status == 200:
return scrapy.Request(
url=self.homework_page_url,
callback=self.parse_homework_page
)
This checks that the login was successful based on the status code, then follows the link to the homework assignments page. Again, it uses a callback
to specify parse_homework_page
as the handler for the next response.
Finally, we can extract the actual homework assignment data in the parse_homework_page
method using XPath expressions:
def parse_homework_page(self, response):
rows = response.xpath(‘//table[@class="table"]/tr‘)
data = {}
for row in rows[1:]:
columns = row.xpath(‘td‘)
date = columns[0].xpath(‘text()‘).get()
if date == ‘12.03.2024‘:
subject = columns[1].xpath(‘text()‘).get()
assignment = columns[2].xpath(‘text()‘).get()
data[subject] = assignment
yield data
This navigates to the homework table on the page, skips the header row, and iterates through each data row. It extracts the date, subject, and assignment details using relative XPath expressions and stores them in a dictionary.
Once all rows are processed, the yield
statement emits the scraped data for further handling. Scrapy will automatically pass this data through any defined item pipelines and exporters.
Here‘s the complete code for our HomeworkSpider
:
import scrapy
class HomeworkSpider(scrapy.Spider):
name = ‘homework_spider‘
allowed_domains = [‘systemcraftsman.github.io‘]
start_urls = [‘https://systemcraftsman.github.io/scrapy-demo/website/index.html‘]
homework_page_url = ‘https://systemcraftsman.github.io/scrapy-demo/website/homeworks.html‘
def parse(self, response):
formdata = {‘username‘: ‘student‘, ‘password‘: ‘12345‘}
return scrapy.FormRequest(
url=‘https://systemcraftsman.github.io/scrapy-demo/website/welcome.html‘,
method=‘GET‘,
formdata=formdata,
callback=self.after_login
)
def after_login(self, response):
if response.status == 200:
return scrapy.Request(
url=self.homework_page_url,
callback=self.parse_homework_page
)
def parse_homework_page(self, response):
rows = response.xpath(‘//table[@class="table"]/tr‘)
data = {}
for row in rows[1:]:
columns = row.xpath(‘td‘)
date = columns[0].xpath(‘text()‘).get()
if date == ‘12.03.2024‘:
subject = columns[1].xpath(‘text()‘).get()
assignment = columns[2].xpath(‘text()‘).get()
data[subject] = assignment
yield data
Now let‘s test it out! Run the spider using:
scrapy crawl homework_spider -o homework.json
If everything works, you should see the scraped homework assignments exported to a homework.json
file in the project root.
Creating a Meal List Spider
We can use a very similar process to create a spider for extracting school lunch menu data. I‘ll spare you the details, but here‘s the full code:
import scrapy
class MealSpider(scrapy.Spider):
name = ‘meal_spider‘
allowed_domains = [‘systemcraftsman.github.io‘]
start_urls = [‘https://systemcraftsman.github.io/scrapy-demo/website/index.html‘]
meal_page_url = ‘https://systemcraftsman.github.io/scrapy-demo/website/meal-list.html‘
def parse(self, response):
formdata = {‘username‘: ‘student‘, ‘password‘: ‘12345‘}
return scrapy.FormRequest(
url=‘https://systemcraftsman.github.io/scrapy-demo/website/welcome.html‘,
method=‘GET‘,
formdata=formdata,
callback=self.after_login
)
def after_login(self, response):
if response.status == 200:
return scrapy.Request(
url=self.meal_page_url,
callback=self.parse_meal_page
)
def parse_meal_page(self, response):
date = ‘13.03.2024‘
col_index = 6 # Monday column
data = {
‘Breakfast‘: response.xpath(f‘//tr[contains(., "BREAKFAST")]/td[{col_index}]/text()‘).getall(),
‘Lunch‘: response.xpath(f‘//tr[contains(., "LUNCH")]/td[{col_index}]/text()‘).getall(),
‘Salad & Dessert‘: response.xpath(f‘//tr[contains(., "SALAD/DESSERT")]/td[{col_index}]/text()‘).getall(),
‘Fruit Time‘: response.xpath(f‘//tr[contains(., "FRUIT TIME")]/td[{col_index}]/text()‘).getall(),
}
yield data
The main difference is in the parse_meal_page
method. Here we locate the different meal sections in the table by searching for specific text in each row. We then extract the contents of the corresponding cell for the desired date.
Run this spider with scrapy crawl meal_spider -o meals.json
to export the scraped meal data.
Advanced Tips for Web Scraping with Scrapy
While our example scraped a basic static website, real-world web scraping often involves additional challenges like dynamic content, authentication, IP blocking, and more. Here are some tips for handling these situations in Scrapy:
Dynamic Websites
Some websites render content dynamically using JavaScript. This means the data you want to scrape may not exist in the initial HTML response. There are a few ways to handle this:
- Use Scrapy‘s
SplashRequest
with a headless browser like Splash to execute JS and get the rendered DOM - Reverse engineer the AJAX calls used to fetch dynamic data and replicate them in your spider
- Run a full browser like Puppeteer or Selenium to load pages, then pass the HTML to Scrapy for parsing
CAPTCHAs
CAPTCHAs are designed to prevent bots like scrapers from automatically submitting forms and accessing certain pages. Some solutions include:
- Using a CAPTCHA solving service that provides APIs to recognize CAPTCHAs in scraped pages
- Training your own ML model to solve the specific type of CAPTCHA used on the target site
- Detecting CAPTCHAs in middleware and presenting them to a human to solve manually
Managing Cookies and Sessions
Scrapy automatically handles cookies for you, persisting them between requests as needed. You can configure cookie handling with settings like COOKIES_ENABLED
, COOKIES_DEBUG
, and COOKIES_PERSISTENCE
.
For more custom behavior, you can subclass Scrapy‘s default cookie middleware or write your own from scratch. This allows you to do things like:
- Save and reuse cookies across spider runs
- Set custom cookies (e.g. for authentication)
- Implement session handling logic
Avoiding IP Bans
Scraping too aggressively or frequently from one IP can get you banned from a website. To avoid this:
- Slow down your crawl speed with
DOWNLOAD_DELAY
andAUTOTHROTTLE_ENABLED
settings - Distribute requests over a pool of rotating proxy IP addresses
- Use a headless browser to better simulate human behavior
- Set a custom
User-Agent
header to avoid looking like a scraper
Bright Data Integration
For even more power and flexibility, you can integrate Scrapy with Bright Data – a leading web data platform. Bright Data‘s tools are compatible with Scrapy and can help you:
- Access a huge pool of proxy IPs from every country and city in the world
- Automatically retry failed requests and rotate IPs to avoid bans
- Collect data even from sites that are difficult to crawl due to anti-bot measures
- Render JS heavy pages and solve CAPTCHAs automatically
Check out the Bright Data website to learn more and start a free trial.
Conclusion
Web scraping with Scrapy is a powerful way to extract data from websites at scale. In this guide, we covered how to:
- Create a new Scrapy project
- Define spider classes to crawl and parse web pages
- Extract data from HTML using CSS and XPath selectors
- Format, clean, and export scraped data
- Handle authentication, dynamic content, IP bans, and more
We walked through a practical example of scraping school-related data like homework assignments and lunch menus. But the same techniques can be adapted for a wide variety of use cases.
With a solid foundation in Scrapy and some practice, you‘ll be able to efficiently collect web data for all kinds of applications like market research, news aggregation, real estate listings, and more. Thanks for reading!