English English French French Spanish Spanish German German
Geekflare is supported by our audience. We may earn affiliate commissions from buying links on this site.
Share on:

Web Scraping Using Python: Step-by-Step Guide

python-web-scraping
Invicti Web Application Security Scanner – the only solution that delivers automatic verification of vulnerabilities with Proof-Based Scanning™.

Web scraping is the idea of extracting information from a website and using it for a particular use case.

Let’s say you are trying to extract a table from a webpage, convert it to a JSON file and use the JSON file for building some internal tools. With the help of web scraping, you can extract the data you want by targeting the specific elements in a webpage. Web scraping using Python is a very popular choice as Python provides multiple libraries like BeautifulSoup, or Scrapy to extract data effectively.

web-scraping

Having the skill of extracting data efficiently is also very important as a developer or a data scientist. This article will help you understand how to scrape a website effectively and get the necessary content to manipulate it according to your need. For this tutorial, we’ll be using the BeautifulSoup package. It is a trendy package for scraping data in Python.

Why use Python for Web Scraping?

Python is the first choice for many developers when building web scrapers. There are many reasons why Python is the first choice, but for this article, let’s discuss three top reasons why Python is used for data scraping.

Library and Community Support: There are several great libraries, like BeautifulSoup, Scrapy, Selenium, etc., that provide great functions for effectively scraping web pages. It has built an excellent ecosystem for web scraping, and also because many developers worldwide already use Python, you can quickly get help when you are stuck.

Automation: Python is famous for its automation capabilities. More than web scraping is required if you are trying to make a complex tool that relies on scraping. For example, if you want to build a tool that tracks the price of items in an online store, you’ll need to add some automation capability so that it can track the rates daily and add them to your database. Python gives you the ability to automate such processes with ease.

Data Visualization: Web scraping is heavily used by data scientists. Data scientists often need to extract data from web pages. With libraries like Pandas, Python makes data visualization simpler from raw data.

Libraries for Web Scraping in Python

There are several libraries available in Python for making web scraping simpler. Let’s discuss the three most popular libraries here.

#1. BeautifulSoup

One of the most popular libraries for web scraping. BeautifulSoup has been helping developers scrape web pages since 2004. It provides simple methods to navigate, search and modify the parse tree. Beautifulsoup itself also does the encoding for incoming and outgoing data. It is well-maintained and has a great community.

#2. Scrapy

Another popular framework for data extraction. Scrapy has more than 43000 stars on GitHub. It can also be used to scrape data from APIs. It also has a few interesting built-in support, like sending emails.

#3. Selenium

Selenium is not mainly a web scraping library. Instead, it is a browser automation package. But we can easily extend its functionalities for scraping webpages. It uses the WebDriver protocol for controlling different browsers. Selenium has been in the market for almost 20 years now. But using Selenium, you can easily automate and scrape data from webpages.

Challenges with Python Web Scraping

One can face many challenges when trying to scrape data from websites. There are issues like slow networks, anti-scraping tools, IP-based blocking, captcha blocking, etc. These issues can cause massive problems when trying to scrape a website.

But you can effectively bypass challenges by following some ways. For example, in most cases, an IP address is blocked by a website when there is more than a certain amount of requests sent in a specific time interval. To avoid IP blocking, you’ll need to code your scraper so that it cools down after sending requests.

challenges-in-web-scraping

Developers also tend to put honeypot traps for scrapers. These traps are usually invisible to bare human eyes but can be crawled by a scraper. If you are scraping a website that puts such a honeypot trap, you’ll need to code your scraper accordingly.

Captcha is another severe issue with scrapers. Most websites nowadays use a captcha to protect bot access to their pages. In such a case, you might need to use a captcha solver.

Scraping a Website with Python

As we discussed, we’ll be using BeautifulSoup to scrap a website. In this tutorial, we will scrape the historical data of Ethereum from Coingecko and save the table data as a JSON file. Let’s move on to building the scraper.

The first step is to install BeautifulSoup and Requests. For this tutorial, I’ll be using Pipenv. Pipenv is a virtual environment manager for Python. You can also use Venv if you want, but I prefer Pipenv. Discussing Pipenv is beyond the scope of this tutorial. But if you want to learn how Pipenv can be used, follow this guide. Or, if you want to understand Python virtual environments, follow this guide.

Launch the Pipenv shell in your project directory by running the command pipenv shell. It will launch a subshell in your virtual environment. Now, to install BeautifulSoup, run the following command:

pipenv install beautifulsoup4

And, for installing requests, run the command similar to the above:

pipenv install requests

Once the installation is complete, import the necessary packages into the main file. Create a file called main.py and import the packages like the below:

from bs4 import BeautifulSoup
import requests
import json

The next step is to get the historical data page’s contents and parse them using the HTML parser available in BeautifulSoup.

r = requests.get('https://www.coingecko.com/en/coins/ethereum/historical_data#panel')

soup = BeautifulSoup(r.content, 'html.parser')

In the above code, the page is accessed using the get method available in the requests library. The parsed content is then stored in a variable called soup.

The original scraping part starts now. First, you’ll need to identify the table correctly in the DOM. If you open this page and inspect it using the developer tools available in the browser, you’ll see that the table has these classes table table-striped text-sm text-lg-normal.

coingecko
Coingecko Ethereum Historical Data Table

To correctly target this table, you can use the find method.

table = soup.find('table', attrs={'class': 'table table-striped text-sm text-lg-normal'})

table_data = table.find_all('tr')

table_headings = []

for th in table_data[0].find_all('th'):
    table_headings.append(th.text)

In the above code, first, the table is found using the soup.find method, then using the find_all method, all tr elements inside the table are searched. These tr elements are stored in a variable called table_data. The table has a few th elements for the title. A new variable called table_headings is initialized for keeping the titles in a list.

A for loop is then run for the first row of the table. In this row, all elements with th are searched, and their text value is added to the table_headings list. The text is extracted using the text method. If you print the table_headings variable now, you’ll be able to see the following output:

['Date', 'Market Cap', 'Volume', 'Open', 'Close']

The next step is to scrape the rest of the elements, generate a dictionary for each row, and then append the rows into a list.

for tr in table_data:
    th = tr.find_all('th')
    td = tr.find_all('td')

    data = {}

    for i in range(len(td)):
        data.update({table_headings[0]: th[0].text})
        data.update({table_headings[i+1]: td[i].text.replace('\n', '')})

    if data.__len__() > 0:
        table_details.append(data)

This is the essential part of the code. For each tr in the table_data variable, first, the th elements are searched. The th elements are the date shown in the table. These th elements are stored inside a variable th. Similarly, all the td elements are stored in the td variable.

An empty dictionary data is initialized. After the initialization, we loop through the range of td elements. For each row, first, we update the first field of the dictionary with the first item of th. The code table_headings[0]: th[0].text assigns a key-value pair of date and the first th element.

After initializing the first element, the other elements are assigned using data.update({table_headings[i+1]: td[i].text.replace('\\n', '')}). Here, td elements text is first extracted using the text method, and then all \\n is replaced using the replace method. The value is then assigned to the i+1th element of table_headings list because the ith element is already assigned.

Then, if the data dictionary length exceeds zero, we append the dictionary to the table_details list. You can print the table_details list to check. But we’ll be writing the values a JSON file. Let’s take a look at the code for this,

with open('table.json', 'w') as f:
    json.dump(table_details, f, indent=2)
    print('Data saved to json file...')

We are using the json.dump method here to write the values into a JSON file called table.json. Once the writing is complete, we print Data saved to json file... into the console.

Now, run the file using the following command,

python run main.py

After some time, you’ll be able to see the Data saved to JSON file… text in the console. You’ll also see a new file called table.json in the working file directory. The file will look similar to the following JSON file:

[
  {
    "Date": "2022-11-27",
    "Market Cap": "$145,222,050,633",
    "Volume": "$5,271,100,860",
    "Open": "$1,205.66",
    "Close": "N/A"
  },
  {
    "Date": "2022-11-26",
    "Market Cap": "$144,810,246,845",
    "Volume": "$5,823,202,533",
    "Open": "$1,198.98",
    "Close": "$1,205.66"
  },
  {
    "Date": "2022-11-25",
    "Market Cap": "$145,091,739,838",
    "Volume": "$6,955,523,718",
    "Open": "$1,204.21",
    "Close": "$1,198.98"
  },
// ...
// ... 
]

You have successfully implemented a web scraper using Python. To view the complete code, you can visit this GitHub repo.

Conclusion

This article discussed how you could implement a simple Python scrape. We discussed how BeautifulSoup could be used for scraping data quickly from the website. We also discussed other available libraries and why Python is the first choice for many developers for scraping websites.

You may also look at these web scraping frameworks.

Thanks to our Sponsors
More great readings on Crypto
Power Your Business
Some of the tools and services to help your business grow.
  • Invicti uses the Proof-Based Scanning™ to automatically verify the identified vulnerabilities and generate actionable results within just hours.
    Try Invicti
  • Web scraping, residential proxy, proxy manager, web unlocker, search engine crawler, and all you need to collect web data.
    Try Brightdata
  • Semrush is an all-in-one digital marketing solution with more than 50 tools in SEO, social media, and content marketing.
    Try Semrush
  • Intruder is an online vulnerability scanner that finds cyber security weaknesses in your infrastructure, to avoid costly data breaches.
    Try Intruder