English English French French Spanish Spanish German German
Geekflare is supported by our audience. We may earn affiliate commissions from buying links on this site.
Share on:

How to Extract Website Meta Data Using Geekflare Meta Scraping API

How to Extract Website Meta Data using Geekflare Meta Scraping API
Invicti Web Application Security Scanner – the only solution that delivers automatic verification of vulnerabilities with Proof-Based Scanning™.

In general, web scraping is extracting data from a website from the HTML produced when a webpage is loaded.

Metascraping is extracting the webpage’s metadata from the meta tags of a webpage.

The metadata of a webpage is information about the page but not the page’s content. For example, the metadata may include the author’s name, title, and the web page’s description.

It helps users and search engines understand what the page is about. Scraping metadata allows a user quickly collect information about web pages in less time.

Several approaches can be used to scrap webpages for their metadata, including scraping manually, using a library, or using an API such as the Geekflare Metascraping API.

Many Ways to Kill a Cat

To scrap manually, one can open a webpage using Chrome DevTools and extract the metadata from the Elements tab. However, this manual, repetitive and tedious when you are dealing with multiple pages. We can automate the task using multiple approaches:

The first approach is to write the code from scratch. In this approach, you make an HTTP request to the website whose metadata you want to extract. Afterward, you can parse the response HTML extracting data from the meta tags using regular expressions or pattern matching. However, this approach is reinventing the wheel as you will spend time rewriting existing code.

The second approach is to use a library in whatever programming language you prefer. This allows you to abstract over the implementation detail and keeps things simple. However, if the programming language of your choice does not have a suitable library or the particular runtime you are using does not support the library, then you cannot use it.

The third approach is to use an API like the Geekflare Metascraping API. This approach is ideal because it gives you a uniform interface regardless of your programming language. It is usable in any language as long as it supports making HTTP requests.

This article will demonstrate how to use the Geekflare Metascraping API with cURL, PHP, and JavaScript(NodeJS).

Why Should You Use the Geekflare Metascraping API?

Because of the disadvantages of other approaches, the advantages of using the Geekflare API are:

  • It is language and runtime environment agnostic.
  • You avoid reinventing the wheel and spend less time writing code.
  • You can scrape multiple websites efficiently(in a matter of seconds).
  • It is incredibly easy to use.
  • You can use it for free.

Getting Started Using the Geekflare API

To use the Geekflare API, you will need an API key. To obtain one, go to the Geekflare Website and create a free account. After creating your account, log in to the dashboard. From the dashboard, you should be able to see your API key.

Geekflare-Dashboard-1

Geekflare Metascraping API Overview

The API endpoint is located at https://api.geekflare.com/metascraping. When you make a request, you should provide your API key as a request header with the name x-api-key and the value being your API key.

You will also need to pass in additional parameters in the request body. These are the url, device, and proxyCountry.

  • URL specifies the URL of the webpage whose metadata you want to scrape.
  • Device specifies the device used to visit the site when scraping metadata. Your options for devices are mobile or desktop.
  • Proxy country specifies the country from which the request should be made before the data is scraped. The proxy country, however, is a premium feature and can only be used under the Geekflare paid plans.

Given that parameters will be passed as part of the body, the request has to be a POST request since GET requests cannot contain metadata.

Using the Geekflare Metascraping API in cURL

In the first demonstration, we will use the cURL utility from the command line to request the Metascraping API. To use cURL, you will need to install it first.

I am going to be using a Bash terminal. This should be the default terminal on macOS and Linux. For Windows, you will have to install Git Bash.

After cURL is installed, we can use the cURL command to make the request. We will pass in options to the command to specify the request parameters: the request method, the endpoint, the request body, and the request headers.

curl -X POST \
https://api.geekflare.com/metascraping \ 
-d '{ "url": "https://tesla.com" }' \
-H 'Content-Type: application/json' \
-H 'x-api-key: <API_KEY>'

NB: The backslash after the first three lines allows you to break the command input into multiple lines.

This command specified the HTTP method as POST and the endpoint as the Geekflare API meta-scraping endpoint.

We also sent the request body as a JSON object with a URL property specified as https://tesla.com. Lastly, we added the headers that specify the body content type as JSON and provided the API key using the x-api-key header.

When we run this command, we get the following output:

{"timestamp":1669328564856,"apiStatus":"success","apiCode":200,"meta":{"url":"https://tesla.com","device":"desktop","test":{"id":"1fh2c30i05vmvxb99pdh6t6hze2x72jv"}},"data":{"author":null,"date":null,"description":"Tesla is accelerating the world’s transition to sustainable energy with electric cars, solar and integrated renewable energy solutions for homes and businesses.","image":"https://tesla-cdn.thron.com/delivery/public/image/tesla/6139697c-9d6a-4579-837e-a9fc5df4a773/bvlatuR/std/1200x628/Model-3-Homepage-Social-LHD","logo":"https://tesla.com/themes/custom/tesla_frontend/assets/favicons/favicon-196x196.png","publisher":"Tesla","title":"Electric Cars, Solar & Clean Energy | Tesla","url":"https://www.tesla.com/","lang":"en"}}

That is the correct output.

Using the Geekflare Metascraping API with JavaScript

For this project, we will create a NodeJS script to fetch data from the API. This means you will need NodeJS installed. You will also need NPM or any other package manager for Node to manage the project’s dependencies. I am also going to be using the Bash terminal to run commands.

To use the API in JavaScript, we first create an empty project folder and open it in a terminal.

mkdir metascraping-js && cd metascraping-js

After this, we can create the file where we are going to write the script:

touch index.js

Then we can instantiate the project as a Node project:

npm init -y

To use ESModule syntax inside our file, add the line “type : module” to the root of the package.json file such that it looks like this:

{
  "name": "metascraping",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "type": "module",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
}

Next, we will install the node-fetch package. This package provides a fetch function in NodeJS that is similar to the browser’s fetch function. This makes it easier to make HTTP requests in NodeJS than using the built-in http module to make requests.

npm install node-fetch

When the package is correctly installed, we can start editing the script. Open the index.js file using a text editor of your choice. In my case, I am going to be using the terminal-based nano text editor.

nano index.js

Editing the index.js file, we start by importing the fetch function, which is the default export of the node-fetch module.

import fetch from 'node-fetch'

Then, we will define the body of our request. This is going to be a JSON string with a url property. The url property’s value is the webpage whose metadata we want to get.

const body = JSON.stringify({ url: 'https://spacex.com' });

Next, we may define the request options we will pass to the fetch function when we eventually call it.

const options = {
    method: 'POST',
    headers: {
        'Content-Type': 'application/json',
        'x-api-key': <YOUR API KEY here>
    },
    body: body
}

We have defined our request method as being a POST request. We also defined two headers. One specifies that the body contains JSON data, and the other provides the API key.

You may replace <YOUR API KEY> with your actual API key. In practice, the API key should not be hard-coded into the file but should be loaded using environmental variables. Lastly, we specified the body property as the value of the body constant we defined earlier.

Finally, we make the call to fetch

fetch('https://api.geekflare.com/metascraping', options)
    .then(response => response.json())
    .then(json => console.log(json))

Here, we have called the fetch function, passing in the API endpoint and the options we defined earlier. Since fetch returns a promise, we attached a callback that parses the JSON responses using then.

The callback returns another promise, and when it resolves, we are going to console.log() the returned object.

So ultimately, our file should look like this.

import fetch from 'node-fetch'

const body = JSON.stringify({ url: 'https://spacex.com' });

const options = {
    method: 'POST',
    headers: {
        'Content-Type': 'application/json',
        'x-api-key': <YOUR API KEY here>
    },
    body: body
}

fetch('https://api.geekflare.com/metascraping', options)
    .then(response => response.json())
    .then(json => console.log(json))

To run the script, save the edits, and close nano or the text editor you are using, then enter the following command:

node .

You should get the following metadata:

{
  timestamp: 1669305079698,
  apiStatus: 'success',
  apiCode: 200,
  meta: {
    url: 'https://spacex.com',
    device: 'desktop',
    test: { id: '8m3srgqw06q2k8li5p6x70s8165d6e2f' }
  },
  data: {
    author: null,
    date: null,
    description: 'SpaceX designs, manufactures and launches advanced rockets and spacecraft.',
    image: 'https://www.spacex.com/static/images/share.jpg',
    logo: 'https://spacex.com/static/images/favicon.ico',
    publisher: 'SpaceX',
    title: 'SpaceX',
    url: 'http://www.spacex.com/',
    lang: 'en'
  }
}

Using the Geekflare API with PHP

To use the Geekflare Metascraping API, first ensure you have PHP and Composer installed on your local machine.

To begin, create and open the project folder.

mkdir metascraping-php && cd metascraping-php

Next, install GuzzleHTTP. Guzzle is one of the many PHP clients you can use with the Geekflare API.

composer require guzzlehttp/guzzle

Once Guzzle is installed, we can create a script with

touch script.php

Then we can start writing the code. Using a text editor of your choice, open the script.php file. In my case, I am going to use nano which is a terminal-based text editor.

nano script.php

Inside the script, we insert boiler-plate PHP

<?php
    // All code goes here
?>

Now to load the extensions, import the Request and Client classes from Guzzle. This code should be written between the <?php and ?> we wrote before.

require_once('vendor/autoload.php');

use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;

Next, we can create a client by instantiating the GuzzleHttp\Client class

$client = new GuzzleHttp\Client();

Afterward, we can define headers for the request. For this particular request, we will provide two headers, one that specifies that the content type of the body is JSON and the other containing our API key.

$headers = [
    'x-api-key' => <YOUR API KEY HERE>,
    'Content-Type' => 'application/json'
];

Replace <YOUR API KEY HERE> with your actual API key from the Geekflare API dashboard.

Then, we can define the body. In our case, the body is going to be a JSON string with the property url set to "https://twitter.com"

$body = json_encode([
    "url" => "https://twitter.com"
]);

To create a request, we instantiate the request class we imported earlier, passing in the request method, the endpoint, the headers, and the request body.

$request = new Request('POST', 'https://api.geekflare.com/metascraping', $headers, $body);

Next, we use the client to send the request.

$response = $client->sendAsync($request)->wait();

After, we can extract the body of the request and print it to the console

echo $response->getBody();

If you have copied the code correctly, the script.php file should look like this

<?php
    require_once('vendor/autoload.php');

	use GuzzleHttp\Client;
	use GuzzleHttp\Psr7\Request;

	$client = new GuzzleHttp\Client();

	$headers = [
    	'x-api-key' => <YOUR API KEY>,
    	'Content-Type' => 'application/json'
	];

	$body = json_encode([
    	"url" => "https://twitter.com"
	]);

	$request = new Request('POST', 'https://api.geekflare.com/metascraping', $headers, $body);

	$response = $client->sendAsync($request)->wait();

	echo $response->getBody();
?>

Save the script, close it and run it using

php script.php

You should get the following output:

{
    "timestamp":1669322100912,
    "apiStatus":"success",
    "apiCode":200,
    "meta": {
        "url":"https://twitter.com",
        "device":"desktop",
        "test":{ 
            "id":"wn1nj30r04bk0ijtpprwdqmtuirg9lze"
        }
     },
     "data":{ 
         "author":null,
         "date":null,
         "description":"The latest stories on Twitter - as told by Tweets.",
         "image":"https://abs.twimg.com/a/1602199131/img/moments/moments-card.jpg",
         "logo":"https://abs.twimg.com/responsive-web/client-web/icon-ios.b1fc7279.png",
         "publisher":"Twitter",
         "title":"Explore",
         "url":"https://twitter.com/explore",
         "lang":"en"
     }
}

Final Words

This guide went through different ways to consume the Geekflare Metascraping API.

The Metascraping API allows you to also provide more parameters than just the URL one. One such parameter is the proxy parameter, which can only be accessed with the Geekflare API premium plan. Regardless, the Geekflare API remains powerful enough for many uses.

Check out the official documentation of the Geekflare API for more information.

Thanks to our Sponsors
More great readings on Development
Power Your Business
Some of the tools and services to help your business grow.
  • Invicti uses the Proof-Based Scanning™ to automatically verify the identified vulnerabilities and generate actionable results within just hours.
    Try Invicti
  • Web scraping, residential proxy, proxy manager, web unlocker, search engine crawler, and all you need to collect web data.
    Try Brightdata
  • Semrush is an all-in-one digital marketing solution with more than 50 tools in SEO, social media, and content marketing.
    Try Semrush
  • Intruder is an online vulnerability scanner that finds cyber security weaknesses in your infrastructure, to avoid costly data breaches.
    Try Intruder