A detailed guide to web scraping using ChatGPT Code Interpreter and its plugins.
If you’re not into creating some novelty, chances are you need some prerequisite information to begin. Or, you might want to look into the competition for valuable input. In addition, there can be countless reasons for someone to be interested in a specific website’s content.
Web scraping is the process that serves such use cases.
And there are a few ways to go about that. There are heavy-weight tools you can subscribe to for professional scraping of big websites. Alternatively, you may require a specific setup for on-premise processing. Either way, the approach is expensive, time-consuming, and tedious for beginners, especially for scraping a few web pages.
Overview of ChatGPT for Web Scraping
I’m not supposed to introduce ChatGPT to you. Am I?
In short, ChatGPT is a generative AI that responds like humans. You get a chat interface for asking it to complete various tasks, such as inquiring about historical events, writing essays, summarizing, translating, coding, etc.
ChatGPT replies in text. However, there are ChatGPT plugins that enhance its capabilities in many ways. And we’ll be using one such plugin. In addition, we’ll use its Code Interpreter for scraping websites having complicated webpage structures or with active anti-scraping protocols.
Please know that ChatGPT has free and paid versions. But you’ll need the paid subscription (currently, $20 a month) for using the web scraper plugin or its Code Interpreter engine.
In further sections, I’ll illustrate the process step-by-step.
Disclaimer: Before proceeding yourself, please confirm that the subject website allows scraping their content. If not, you can contact their admin and see if they permit it for you to avoid any legal troubles.
Web Scraping Using ChatGPT Plugin
Login to your OpenAI account, hover over GPT-4 (its current paid version) and click Plugins.
Next, click No plugins enabled, scroll down, and click Plugin Store.
Please note that instead of No plugins enabled, you’ll have a plugin icon if one is active. In that case, you need to click that icon to open the drop-down and click the Plugin store at the bottom.
This will open the Plugin store. Search for Scraper and hit Install.
Select this plugin in the ChatGPT interface.
Once this is selected, one must prompt ChatGPT, mentioning the subject URL and the content for scraping.
I have done this for a few websites. Check this out.
Scraping a Publication
We are a tech-focussed publication, and I have chosen our home page, geekflare.com/ for this illustration.
Here’s the prompt:
check this webpage: https://geekflare.com/ and prepare a table indicating the article title, author, publication date, and excerpt for the top 10 articles.
One can also re-prompt to convert the data into CSV format, paste it in a text file with .csv extension, and open it in a spreadsheet application like MS Excel.
Scraping a Deal or Coupon Webpage
The Geekflare deals section is where we have handpicked some offers on top-tech projects. How about fetching every deal in a tabular format?
Prepare a list of deals from this webpage: https://geekflare.com/deals/. present the result in a tabular format.
Scraping Wikipedia
Summarize in tabular format the latest news from the "in the news" section from this wikipedia page: https://en.wikipedia.org/wiki/Main_Page
Scraping E-commerce Stores
Lastly, I tried scraping Amazon.com for the laptops by applying a few filters and feeding the URL to ChatGPT. This is what I got:
The problem is this isn’t a single case. You’ll find many such instances where the websites have anti-scraping measures. In this situation, you’ll need to find an alternative for getting the data if subscribing to industry-standard scrapers isn’t an option.
The following sections entail one such solution.
Web Scraping Using ChatGPT Code Interpreter
Code Interpreter is a newly launched ChatGPT engine to cater to programming-related tasks. While the default engine heavily relies on text responses, Code Interpreter can help visualize outputs, parse, debug, & execute code, integrate with software binaries, and do a lot more programming-centric things.
In this process, we will download the source HTML, upload it to ChatGPT Code Interpreter, and proceed with the scraping.
I have taken this page for extraction:
We will begin by saving the webpage as HTML. For that, go to the webpage and press Ctrl+S
.
Now we have the file for scraping. Let’s figure out the prompt.
In addition to the text prompt, you can see I have given it sample elements to fast-track the scraping. Since Amazon’s web page structures are complex, without these samples, the scraping attempt might fail or result in nothing.
And getting these elements is fairly easy. Right-click anywhere on the subject webpage and click Inspect from the pop-over.
First, click the topmost icon (marked as 1). This will highlight the details while you select elements from the page. Next, select the container element for any specific product.
Please ensure to select the innermost container. You can hover along, and it will keep highlighting. The moment you get the last shell covering that block, you can click and go over to the right side to copy the element’s div class
.
Similarly, select the samples for other elements.
Finally, upload the HTML and prompt similar to this:
check out this webpage html and extract the laptop titles, price, and ratings. present the result in a tabular format within this chat interface and also give the results in a CSV to download.
div class="s-card-container s-overflow-hidden aok-relative puis-include-content-margin puis puis-vfcg1duwvmpo42mcln9ojhiljk s-latency-cf-section s-card-border"
sample title element: span class="a-size-medium a-color-base a-text-normal"
sample price element: span class="a-price-whole"
sample ratings element: span class="a-size-base puis-bold-weight-text"
This will take some time while ChatGPT Code Interpreter does its work. You will have a few details, whereas everything will be in the embedded CSV file.
You can observe that the table has a few entries not present on the original web page, especially at the start. In such cases, you need to double-check and clean the data for any redundancies.
If there are any, you can re-prompt ChatGPT to get a clean CSV.
Final Thoughts
ChatGPT does many things, and basic web scraping is one of them. Agreed, it might not be suitable for someone scraping hundreds of pages. Still, it’ll get you started in the right direction and ideal for a short scraping session.
In this guide, we have used one of its scraping plugins and Code Interpreter. While plugins work on many standard websites, the second method is for custom webpage structures or if the page has dynamic elements (endless scroll, read more, etc.).
And to reiterate, go through the subject website terms before scraping.
PS: Check out these cloud scraping solutions and our own Geekflare scraping API.