Geekflare is supported by our audience. We may earn affiliate commissions from buying links on this site.
In Development Last updated: June 2, 2023
Share on:
Invicti Web Application Security Scanner – the only solution that delivers automatic verification of vulnerabilities with Proof-Based Scanning™.

Data extraction is the process of gathering specific data from web pages. Users can extract text, images, videos, reviews, products, etc. You can extract data to perform market research, sentiment analysis, competitive analysis, and aggregate data. 

If you are dealing with a small amount of data, you can extract data manually by copy-pasting the specific information from web pages to a spreadsheet or document format of your liking. For instance, if, as a customer, you are looking for reviews online to help you make a purchase decision, you can scrap data manually. 

On the other hand, if you are dealing with large data sets, you need an automated data-extracting technique. You can create an in-house data extraction solution or use Proxy API or Scraping API for such tasks. 

However, these techniques may be less effective as some of the sites you target might be protected by captchas. You may also have to manage bots and proxies. Such tasks can take much of your time and limit the nature of the content you can extract. 

Scraping Browser: The Solution

SCRAPING-BROWSER-2

You can overcome all these challenges through the Scraping Browser by Bright Data. This all-in-one browser helps collect data from websites that are hard to scrape. It is a browser that uses a graphical user interface (GUI) and is controlled by Puppeteer or Playwright API, making it undetectable by bots. 

Scraping Browser has built-in unlocking features that automatically handle all the blocks on your behalf. The browser is opened on Bright Data’s servers, meaning you don’t need expensive in-house infrastructure to scrap data for your large-scale projects. 

Features of Bright Data Scraping Browser

  • Automatic website unlocks: You don’t have to keep refreshing your browser as this browser adjusts automatically to handle CAPTCHA solving, new blocks, fingerprints, and retries. Scraping Browser mimics a real user. 
  • A big proxies network: You can target any country you want, as Scraping Browser has over 72 million IPs. You can target cities or even carriers and benefit from the best-in-class technology. 
  • Scalable: You can open thousands of sessions simultaneously as this browser uses the Bright Data infrastructure to handle all the requests.
  • Puppeteer and Playwright compatible: This browser allows you to make API calls and fetch any number of browser sessions either using Puppeteer (Python) or Playwright (Node.js). 
  • Saves time and resources: Instead of setting up proxies, the Scraping Browser takes care of everything in the background. You also don’t have to set up in-house infrastructure, as this tool takes care of everything in the background. 

How to Set up Scraping Browser

  • Head over to the Bright Data website and click on the Scraping Browser on the “Scraping Solutions” tab. 
  • Create an account. You will see two options; “Start free trial” and “Start free with Google”. Let us pick “Start free trial” for now and move to the next step. You can either create the account manually or use your Google account. 
Bright-Data-Sign-up-2
  • When your account is created, the dashboard will present several options. Select “Proxies & Scraping Infrastructure”. 
Bright-Data-tools-1
  • In the new window that opens up, select Scraping Browser and click on “Get started”.
Scraping-browser-1
  • Save and activate your configurations. 
Scraping-Browser-activation-1
  • Activate your free trial. The first option gives you a $5 credit that you can use toward your proxy usage. Click on the first option to try this product out. However, if you are a heavy user, you can click on the second option that gives you $50 for free if you load your account with $50 or more. 
Scraping-Browser-free-trial
  • Enter your billing information. Don’t worry, as the platform will not charge you anything. The billing information just verifies that you are a new user and not looking for freebies by creating multiple accounts. 
Scraping-Browser-billing-information
  • Create a new proxy. Once you have saved your billing details, you can create a new proxy. Click the “add” icon and select Scraping Browser as your “Proxy type”. Click on “Add Proxy” and move to the next step.
Create-new-proxy
  • Create a new “zone”. A pop will appear asking you if you want to create a new Zone; click “Yes” and continue. 
Create-new-zone
  • Click on “Check out code and integration examples”. You will now get Proxy integration examples that you can use to scrap data from your target website. You can use Node.js or Python to extract data from your target website. 
Code-example

How to Extract Data from a Website    

You now have everything you need to extract data from a website. We shall use our website, geekflare.com, to demonstrate how Scraping Browser works. For this demonstration, we will use node.js. You can follow along if you have node.js installed. 

Follow these steps;

  1. Create a new project on your local machine. Navigate onto the folder and create a file, naming it script.js. We run the scraping code locally and display the results in our terminal. 
  2. Open the project in your favorite code editor. I am using VsCode. 
  3. Install puppeteer. Use this command to; npm i puppeteer-core
  4. Add this code to the script.js file;
const puppeteer = require('puppeteer-core');

   // should look like 'brd-customer-<ACCOUNT ID>-zone-<ZONE NAME>:<PASSWORD>'

const auth='USERNAME:PASSWORD';

async function run(){

  let browser;

  try {

    browser = await puppeteer.connect({browserWSEndpoint: `wss://${auth}@zproxy.lum-superproxy.io:9222`});

    const page = await browser.newPage();

    page.setDefaultNavigationTimeout(2*60*1000);

    await page.goto('https://example.com');

    const html = await page.evaluate(() => document.documentElement.outerHTML);

    console.log(html);

  } 

  catch(e) {

    console.error('run failed', e);

  } 

  finally {

    await browser?.close();

  }

}

if (require.main==module)

     run();
  1. Change the contents on const auth='USERNAME:PASSWORD'; with your account details. Check for your Username, Zone name, and Password in the tab labeled “Access parameters”. 
  2. Input your target URL. For my case, I want to extract data for all the authors on geekflare.com, found at https://geekflare.com/authors

I will change my code on line 10 to be as follows;

await page.goto('<a href="https://geekflare.com/authors/" target="_blank" rel="noopener">https://geekflare.com/authors/</a>');

My final code now will be;

const puppeteer = require('puppeteer-core');

   // should look like 'brd-customer-<ACCOUNT ID>-zone-<ZONE NAME>:<PASSWORD>'

const auth='brd-customer-hl_bc09fed0-zone-zone2:ug9e03kjkw2c';

async function run(){

  let browser;

  try {

    browser = await puppeteer.connect({browserWSEndpoint: `wss://${auth}@zproxy.lum-superproxy.io:9222`});

    const page = await browser.newPage();

    page.setDefaultNavigationTimeout(2*60*1000);

    await page.goto('https://geekflare.com/authors/');

    const html = await page.evaluate(() => document.documentElement.outerHTML);

    console.log(html);

  } 

  catch(e) {

    console.error('run failed', e);

  } 

  finally {

    await browser?.close();

  }

}

if (require.main==module)

     run();
  1. Run your code using this command;
node script.js

You will have something like this on your terminal

How to Export the Data 

You can use several approaches to export the data, depending on how you intend to use it. Today, we can export the data to an html file by changing the script to create a new file named data.html instead of printing it on the console. 

You can change the contents of your code as follows;

const puppeteer = require('puppeteer-core');

const fs = require('fs');

// should look like 'brd-customer-<ACCOUNT ID>-zone-<ZONE NAME>:<PASSWORD>'

const auth = 'brd-customer-hl_bc09fed0-zone-zone2:ug9e03kjkw2c';

async function run() {

  let browser;

  try {

    browser = await puppeteer.connect({ browserWSEndpoint: `wss://${auth}@zproxy.lum-superproxy.io:9222` });

    const page = await browser.newPage();

    page.setDefaultNavigationTimeout(2 * 60 * 1000);

    await page.goto('https://geekflare.com/authors/');

    const html = await page.evaluate(() => document.documentElement.outerHTML);

    // Write HTML content to a file

    fs.writeFileSync('data.html', html);

    console.log('Data export complete.');

  } catch (e) {

    console.error('run failed', e);

  } finally {

    await browser?.close();

  }

}

if (require.main == module) {

  run();

}

You can now run the code using this command;

node script.js

As you can see in the following screenshot, the terminal displays a message saying, “data export complete”. 

export data from Scraping Browser

If we check our project folder, we can now see a file named data.html with thousands of lines of code. 

exported data

What can you Extract using Scraping Browser?

I have just scratched the surface of how to extract data using the Scraping browser. I can even narrow down and scrap only the authors’ names and their descriptions using this tool. 

If you want to use the Scraping Browser, identify the datasets you want to extract and modify the code accordingly. You can extract text, images, videos, metadata, and links, depending on the website you are targeting and the structure of the HTML file. 

FAQs

Is Data Extraction and Web Scraping legal?

Web scraping is a controversial topic, with one group saying it is immoral while others feel it is okay. The legality of web scraping will depend on the nature of the content being scraped and the policy of the target web page. 
Generally, scraping data with personal information such as addresses and financial details is considered illegal. Before you scrap for data, check if the site you are targeting has any guidelines. Always ensure that you don’t scrap that data that is not publicly available. 

Is Scraping Browser a free tool?

No. Scraping Browser is a paid service. If you sign-up for a free trial, the tool gives you a $5 credit. The paid packages start from $15/GB + $0.1/h. You can also opt for the Pay As You Go option that starts from $20/GB + $0.1/h. 

What is the difference between Scraping Browsers and Headless Browsers?

Scraping Browser is a headful browser, meaning that it has a graphical user interface (GUI). On the other hand, headless browsers do not have a graphical interface. Headless browsers such as Selenium are used to automate web scraping but are sometimes limited as they have to deal with CAPTCHAs and bot detection. 

Wrapping Up

As you can see, Scraping Browser simplifies extracting data from web pages. Scraping Browser is simple to use compared to tools such as Selenium.  Even non-developers can use this browser with an awesome user interface and good documentation. The tool has unblocking capabilities unavailable in other scrapping tools, making it effective for all who want to automate such processes. 

You may also explore how to stop ChatGPT Plugins from scraping your website content.

  • Titus Kamunya
    Author
    Titus is a Software Engineer and Technical Writer. He develops web apps and writes on SaaS, React, HTML, CSS, JavaScript, Ruby and Ruby on Rails read more
Thanks to our Sponsors
More great readings on Development
Power Your Business
Some of the tools and services to help your business grow.
  • Invicti uses the Proof-Based Scanning™ to automatically verify the identified vulnerabilities and generate actionable results within just hours.
    Try Invicti
  • Web scraping, residential proxy, proxy manager, web unlocker, search engine crawler, and all you need to collect web data.
    Try Brightdata
  • Monday.com is an all-in-one work OS to help you manage projects, tasks, work, sales, CRM, operations, workflows, and more.
    Try Monday
  • Intruder is an online vulnerability scanner that finds cyber security weaknesses in your infrastructure, to avoid costly data breaches.
    Try Intruder