Beautiful Soup vs. Scrapy for web scraping

Beautiful Soup vs. Scrapy for web scraping

Many people who are just starting out with web scraping wonder what the differences are between Beautiful Soup and Scrapy. This post should shed some light on their different features and purposes and help you decide which one is better for your project.

Web scraping: Beautiful Soup or Scrapy?

Python devs are even more likely than most to need web scraping at some point in their careers. In the age of Big Data, knowing how to craft bots to "open" websites and extract information, otherwise known as web scraping, is almost a requirement for anyone who deals with digital data.

Two popular tools for web scraping in Python are Beautiful Soup and Scrapy. Which one is right for your scraping needs? Let's find out.

If you have Python installed on your computer, a basic understanding of CSS selectors, and are comfortable navigating the browser DevTools to find and select page elements, this article is for you.

What is Beautiful Soup?

Beautiful Soup is a Python library that allows you to parse HTML and XML documents and extract data from them. It provides a simple and intuitive way to navigate and search the HTML tree structure, using tags, attributes, and text content as search criteria.

Start off by using pip to install Beautiful Soup and Python Requests as project dependencies:

pip install beautifulsoup4 requests

To scrape a web page, you need to first download the HTML content of the page using an HTTP Client like requests to then parse the page content using BeautifulSoup:

import requests 
from bs4 import BeautifulSoup

url = '<https://www.example.com>' 

response = requests.get(url) 

html_content = response.text

soup = BeautifulSoup(html_content, 'html.parser')

Then, you can use Beautiful Soup methods to extract the data you're interested in. For example, let's say we want to extract the website title and a list of all the URLs on the page:

title = soup.find('title').get_text()
url_list = [] 
links = soup.find_all('a')

for link in links: 
    url = link.get('href')
    url_list.append(url)

print(title, url_list)

This code will print out the title and a URL list of all links on the page.

What is Scrapy?

Scrapy is a Python framework for web scraping that provides a more powerful and customizable way to extract data from websites. It allows you to define a set of rules for how to navigate and extract data from multiple pages of a website and provides built-in support for handling different types of data.

To use Scrapy, you first need to install it using pip:

# Install Scrapy
pip install scrapy

Then, you can create a new Scrapy project using the scrapy command:

# Create Scrapy project
scrapy startproject myproject

This will create a new directory called myproject with the basic structure of a Scrapy project. You can then generate a spider, which is the main component of Scrapy that does the actual scraping:

# Generate Spider
scrapy genspider myspider <https://www.example.com>

Here's an example of a simple spider that extracts the titles and URLs of all the links on a web page:

import scrapy 

class MySpider(scrapy.Spider): 
    name = 'myspider' 
    start_urls = ['<https://www.example.com>'] 

    def parse(self, response): 
        links = response.css('a') 
        for link in links: 
                title = link.css('::text').get() 
                url = link.attrib['href'] 
            yield { 
                'title': title, 
                'url': url, 
            }

This spider defines a parse method that is called for each page that it visits, starting from the URLs defined in start_urls. It uses Scrapy's built-in selectors to extract the title and URL of each link and yields a dictionary with this data.

To run the spider, you then use the scrapy crawl command:

# Run the spider
scrapy crawl myspider

Advanced Scrapy features

Queue of URLs to scrape

Scrapy can manage a queue of requests to scrape, with automatic deduplication and checking of maximum recursion depth. For example, this spider scrapes the titles of all linked pages up to a depth of 5:

import scrapy
class TitleSpider(scrapy.Spider):
    name = 'titlespider'
    start_urls = ['<https://www.example.com>'] 
    custom_settings = {
        "DEPTH_LIMIT": 5
    }

    def parse(self, response):
        yield {
            'url': response.url,
            'title': response.css('title::text').extract_first(),
        }
        for link_href in response.css('a::attr("href")'):
            yield scrapy.Request(link_href.get())

Multiple output formats

Scrapy directly supports saving the output to many different formats, like JSON, CSV, and XML:

# Run the spider and save output into a JSON file
scrapy crawl -o myfile -t json myspider

# Run the spider and save output into a CSV file
scrapy crawl -o myfile -t csv myspider

# Run the spider and save output into a XML file
scrapy crawl -o myfile -t xml myspider

Cookies

Scrapy receives and keeps track of cookies sent by servers and sends them back on subsequent requests as any regular web browser does.

If you want to specify additional cookies for your requests, you can add them to the Scrapy Request you're creating:

request_with_cookies = scrapy.Request(
    url="<http://www.example.com>",
    cookies={'currency': 'USD', 'country': 'UY'},
)

User agent spoofing

Scrapy supports setting the user agent of all requests to a custom value, which is useful, for example, if you want to scrape the mobile version of a website. Just put the user agent in the [settings.py](settings.py)`` file in your project, and it will be automatically used for all requests:

# settings.py
USER_AGENT = 'Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.57 Mobile Safari/537.36'

Whether you're scraping with Beautiful Soup, Scrapy, Selenium, or Playwright, the Apify Python SDK helps you run your project in the cloud at any scale.

When to use Beautiful Soup and when to use Scrapy

You should use Beautiful Soup when you just need to extract data from a few simple web pages, and you don't expect that they will try to block you from scraping them.

You should use Scrapy when you want to scrape a whole website, follow links from one page to another, have to deal with cookies and blocking, and export a lot of data in multiple formats.

The verdict

Okay, so you'll ultimately have to make the choice between Beautiful Soup and Scrapy yourself, but here's a quick summary of the differences for you to keep in mind. Beautiful Soup is generally easier to use and more flexible than Scrapy, making it a good choice for small-scale web scraping tasks. Scrapy is more powerful and customizable, making it a good choice for larger and more complex data extraction projects.