How to Scrape Yelp: The Ultimate Guide to Extracting Yelp Data with Python

If you‘re looking to gather valuable business intelligence, there are few data sources as rich as Yelp. With millions of user-generated reviews, ratings, and business profiles, Yelp offers unparalleled insights into local markets and customer sentiment.

But manually browsing Yelp and copying data is extremely tedious and time-consuming. That‘s where web scraping comes in. Using Python and some basic tools, you can programmatically extract large amounts of Yelp data to analyze for your own purposes.

In this guide, I‘ll walk you through the entire process of scraping data from Yelp using Python. You‘ll learn what data is available, how to set up your scraping environment, and the code you need to automate data extraction. Whether you‘re a marketer, data scientist, or business owner, this knowledge will help you unlock valuable insights from Yelp‘s treasure trove of review data.

Why Scrape Data from Yelp?

Before we dive into the technical details, let‘s discuss why you might want to scrape data from Yelp in the first place. Here are a few key use cases:

Business Intelligence – Yelp data can tell you a lot about your competitors and your local market. You can track ratings over time, analyze customer reviews to identify strengths and weaknesses, benchmark your performance, and discover opportunities for differentiation.

Market Research – Curious about a new market? Yelp data can provide a broad yet granular view into the local business landscape, consumer trends, and unmet needs in any geography. You can assess market size, competitor density, and more.

Natural Language Processing – Yelp hosts a massive corpus of conversational text data in the form of user reviews. This data is useful for training language models, analyzing sentiment, extracting entities, and testing NLP algorithms across domains.

Investment Research – Reviews and ratings can be leading indicators of a business‘s health and growth prospects. Investors can scrape Yelp data to gain an edge in evaluating investment opportunities in local markets and sectors.

The possibilities are endless – I‘m sure you can think of many more applications based on your own needs and interests. Now let‘s get into the nuts and bolts of actually scraping this data.

What Data Can You Scrape from Yelp?

A typical Yelp business profile page contains a wealth of valuable structured and unstructured data points:

  • Business name, address, phone number, website
  • Hours of operation
  • Price range
  • Overall star rating
  • Review count
  • Individual user reviews with text, rating, date
  • Business attributes and categories
  • Top user tags and keywords
  • Photos and videos

Using the techniques explained below, you can extract all of this information and more from any business on Yelp. However, be aware that scraping large amounts of data may violate Yelp‘s Terms of Service, so proceed with caution and make sure you comply with robots.txt restrictions.

Tools and Libraries for Scraping Yelp

To scrape data from Yelp, you‘ll need a basic working knowledge of Python and a few key libraries:

Requests – This library allows you to make HTTP requests from Python, which is essential for fetching the HTML source of web pages. You can install it with pip install requests.

BeautifulSoup – BeautifulSoup is a powerful library for parsing HTML and XML documents. It makes it easy to extract data from specific tags and attributes. Install it via pip install beautifulsoup4.

Pandas – For easier data manipulation and analysis, you‘ll want to use the Pandas library to store your scraped data in a structured format like a DataFrame. Get it with pip install pandas.

Make sure you have these three libraries installed in your Python environment before proceeding. You can verify your installation by running pip list.

Step-by-Step Yelp Scraping Tutorial

Now we‘re ready to build our Yelp scraper step-by-step! Open up a Jupyter Notebook or your favorite Python IDE and follow along.

Step 1: Import libraries and set up constants

First, let‘s import the libraries we‘ll be using and define some constant variables:


import requests
from bs4 import BeautifulSoup
import pandas as pd

BASE_URL = ‘https://www.yelp.com
SEARCH_PATH = ‘/search?find_desc=Restaurants&find_loc=‘
CITIES = [‘New York, NY‘, ‘Los Angeles, CA‘, ‘Chicago, IL‘]

We set the base Yelp URL and search path separately to make it easy to change the business category and location we‘re scraping. In this example, we‘ll scrape restaurant data for the three largest U.S. cities.

Step 2: Fetch the HTML of search results

Next, we define a function to fetch the HTML source of a Yelp search results page given a city name:


def get_search_html(city):
url = BASE_URL + SEARCH_PATH + city.replace(‘, ‘, ‘+‘)
resp = requests.get(url)
return resp.text

This function constructs the full URL for a Yelp search results page by combining the base URL, search path, and city name (formatted for the URL). It then sends a GET request to fetch the HTML source of the page.

Step 3: Parse the HTML to extract business data

With the raw HTML in hand, we now need to parse it and extract the data we want into a structured format. Here‘s a function to do that using BeautifulSoup:


def parse_search_results(html):
soup = BeautifulSoup(html, ‘lxml‘)
businesses = []

for result in soup.find_all(‘div‘, class_=‘biz-listing-large‘):
    name = result.find(‘a‘, class_=‘biz-name‘).text.strip()
    url = BASE_URL + result.find(‘a‘)[‘href‘] 
    category = [cat.text for cat in result.find_all(‘span‘, class_=‘category-str-list‘)]
    rating = float(result.find(‘div‘, class_=‘i-stars‘)[‘title‘].split(‘ ‘)[0])
    review_count = int(result.find(‘span‘, class_=‘review-count‘).text.strip().split(‘ ‘)[0])
    address = result.find(‘address‘).text.strip()

    business = {
        ‘name‘: name,
        ‘category‘: category,
        ‘rating‘: rating,
        ‘review_count‘: review_count, 
        ‘address‘: address,
        ‘url‘: url
    }
    businesses.append(business)

return businesses

This function takes the HTML source as input and does the following:

  1. Parses the HTML using BeautifulSoup and the lxml parser
  2. Finds all the business result DIV elements
  3. For each business result:
    • Extracts the business name, URL, category, rating, review count, and address
    • Stores the extracted data in a dictionary
    • Appends the dictionary to a list
  4. Returns the list of business dictionaries

The secret sauce here is using BeautifulSoup‘s find and find_all methods to locate the desired elements in the parsed HTML tree, and then extracting the relevant bits of data from those elements using dictionary lookups and list comprehensions.

Step 4: Handle pagination

To get all the businesses for a given city, we need to keep fetching and parsing pages of search results until there are no more. Here‘s how we do that:


def scrape_city(city):
businesses = [] page = 1

while True:
    print(f‘Scraping page {page} for {city}...‘)
    url = BASE_URL + SEARCH_PATH + city.replace(‘, ‘, ‘+‘) + f‘&start={10*(page-1)}‘
    html = get_search_html(url)
    page_businesses = parse_search_results(html)

    if not page_businesses:
        break

    businesses.extend(page_businesses)
    page += 1

return businesses

This function scrapes all pages of search results for a given city:

  1. Initializes an empty list to store the scraped businesses and sets the starting page number to 1
  2. Starts an infinite loop that:
    • Constructs the URL for the current page of search results
    • Fetches and parses the HTML using the functions we defined earlier
    • Checks if the current page returned any results, and if not, breaks the loop
    • Extends the master list of businesses with the results from the current page
    • Increments the page number
  3. Returns the list containing all scraped businesses for the city

To fetch the next page of search results, we modify the URL by adding a start parameter that skips ahead by 10 results per page. We keep going until we hit an empty page, indicating we‘ve reached the end of the results.

Step 5: Scrape the cities and store the data

Finally, we tie it all together by looping through our list of cities, scraping the business data for each one, and storing the combined results in a Pandas DataFrame:


all_businesses = []

for city in CITIES:
print(f‘\nScraping {city}...\n‘)
businesses = scrape_city(city)
all_businesses.extend(businesses)

df = pd.DataFrame(all_businesses)
print(f‘\nScraped {len(df)} total businesses:\n‘)
print(df.head())
df.to_csv(‘yelp_businesses.csv‘, index=False)

This script does the following:

  1. Initializes an empty list to store businesses from all cities
  2. Loops through the list of cities and for each one:
    • Calls the scrape_city function to get all businesses
    • Extends the master list with the results
  3. Creates a DataFrame from the master list of scraped business dictionaries
  4. Prints the number of businesses scraped and previews the first few rows
  5. Saves the DataFrame to a CSV file for future analysis

If all went well, you should see a DataFrame with the details of hundreds of restaurants from the cities you specified, and a CSV file saved in your current directory. Congratulations – you‘ve just scraped Yelp!

Tips for Scraping Yelp Responsibly

Web scraping can be a powerful tool, but it‘s important to do it ethically and responsibly to avoid burdening servers or violating terms of service. Here are a few best practices to keep in mind:

  • Respect robots.txt: Always check if a site allows scraping in its robots.txt file. Yelp is currently permissive, but that may change.
  • Limit speed: Insert delays between requests to avoid hammering the server. A few seconds is usually sufficient.
  • Rotate user agents and IP addresses: Websites can block scrapers that make many requests with the same user agent or IP. Use a pool of user agents and proxies, or a service like Scraper API to handle this for you.
  • Avoid content behind login walls: Don‘t scrape any content that requires you to log in to a user account. This is not only unethical but likely illegal.
  • Cache pages locally: Save copies of the pages you scrape so you can parse them repeatedly without refetching from the server.

Going Further – What‘s Next?

In this guide we covered the basics of scraping Yelp, but there‘s a lot more you can do to enhance your scraper and analyze the resulting data.

Here are some ideas to extend your Yelp scraping project:

  • Upgrade your parser to handle dynamically loaded content that requires JavaScript rendering
  • Scale up your scraper by parallelizing requests across many worker machines or threads
  • Fine-tune your data model to extract even more granular data points like menu items, price points, or owner responses
  • Apply machine learning techniques like sentiment analysis to extracted review text to surface insights
  • Visualize trends and comparisons between scraped businesses using matplotlib or plot.ly
  • Set up automated alerts to track ratings, reviews, and other metrics over time

And if you find that your scraping needs outgrow your custom script, don‘t reinvent the wheel. Consider using a full-fledged web scraping tool like Parsehub or Import.io, or outsourcing your data collection to a professional service like Scrapinghub or ScraperAPI.

With so many possibilities, this tutorial is really just the beginning. I hope it has equipped you with the knowledge and tools you need to start extracting valuable insights from Yelp.

Happy scraping!

Similar Posts