Learn to perform web scraping with Scraper and XPath –

You have probably heard the term web scraping previously. And if not, you may have found it in one of its variants: data scraping, data scraping, data extraction or price scraping. Some of them will have to do with specific applications where this technique is usually used.

In this post we want to delve into what web scraping consists of, as well as some of its main uses. We will talk about a specific Google Chrome extension, scraperwhich will help us when carrying out this type of research online from .

However, it is important to know that it is a technique that continues to cause some stir, since in some sectors, as well as depending on the type of data we want to obtain, it can touch certain ethical limits. It is recommended that you take into account the legal limit and data protection when carrying it out, an issue that we will discuss later.

Keep reading to discover everything you need to know about this technique!

What is web scraping?

When we talk about web scraping, we refer to a technique that consists of extracting information from a certain web page in an automated way. It’s about a “data scraping” that will provide us with information in the form of HTML code.

Copying data from a web page and transferring it to a spreadsheet could be considered manual web scraping. With a web scraping tool, we program a bot crawler which would enter the web page, do the data scraping and then copy it into a database.

The most common use of this technique is usually data scraping, however, it is also common to scrape images, videos or any other type of file.

What is scraping used for?

Although today many data extractions are still carried out manually, it is true that when dealing with large-scale data, more and more companies are opting for web scraping techniques. Here we tell you some of the Applications business most common:

Price scratching: Since price plays an important role in the purchase decision, many companies use web scraping to inspect competitor’s databases. The main objective will be to access price information to later make your offers stand out on comparison websites.
Content Marketing: Another common practice of web scraping will be content scraping, which will help obtain certain information to later work on the content marketing strategy.
SEO analysis: It is interesting to carry out the scraping of the SERPs themselves, when it comes to obtaining, for example, a list of Titles on a specific keyword, which can help us later to refine our labels.
Brand and product protection: Another application of this technique is usually when monitoring a brand or product, since it can be tracked through all channels, ensuring that established regulations are met.
Lead generation: Try to take web data when prospecting new customers. However, with the RGPD, the application of this type of techniques has had to modify the processes a lot.

Is web scraping legal?

As we have commented previously, the use of this technique in certain sectors, as well as in obtaining specific data, sometimes borders on the ethical limits. It is usually used to obtain data from other web pages, thus replicating the content through the use of an API and producing a copy of information.

In this way, unfair competition, violation of copyright or intellectual property and registered trademarks can be incurred, and even violate the RGPD regulation. Furthermore, it can overload the servers of the scraped sites.

For this reason, we recommend that if you choose to carry out any web scraping technique, you are advised to ensure that you are within the legal limits and established data protection.

Learn how to use the Scraper extension with XPath

When it comes to web scraping, we recommend a tool that works from data mining. , is an extension of Google Chrome that will help you when obtaining online information about a web page, quickly exporting the information in the form of a spreadsheet.

This extension will work from XPath languagewhere from expressions we will get that data collection that we can later work on them.

What is its operation?

To perform scraping on a specific website, it is important to place ourselves on that specific element or URL where we want to perform web scraping. Once we have installed the extension, when we click on the right button of our mouse, a new option will appear “similar scrape”. When clicking, it will show us a screen, where, depending on the XPath expression that we are using, it will bring up a list of results, on which we can later work by exporting its content. For the example, we will have selected the scraping of data on the h2 headers that are established in a specific post. In this way, the tool will provide us with a list of all the h2 that make up this particular post.

In the event that our objective is to carry out the scraping in the SERP, it will be useful for us to perform a search for the specific keyword. Once we have reached this point, we must place ourselves on the specific element that we want to analyze. In this case, the Titles in the main search results. In the same way, pressing the right mouse button will offer us the possibility of “similar scrape”. Next, it will show us, from a specific XPath expression, the Titles of the main search results, being able to later export it to a calculation document, or, where appropriate, to the clipboard.

How to use the XPath language?

XPath or also called XML Path Language, is the language capable of building expressions that go through and process a XML document concrete. From this language, we can search for and select elements, always taking into account the XML hierarchical structure.

More specifically, the Scraper tool, presented above, will use this language when it comes to understanding the expressions about those elements that we want to search for and select within the XML document.

It is important to know that an XML document is processed in such a way that it is built, what we call a node tree. This tree will start from a root element and it will diversify into different elements that will hang from it. The root node will be identified by /.

Finally, it will present leaf nodes, empty and containing only attributes: comments, texts, instructions, etc. Attributes will be identified starting with @. As a special case, it is worth highlighting the so-called attribute nodes, since a node can have as many attributes as it wishes, creating an attribute node for each of them.

At this point, you may be wondering how you can get the concrete XPath expression, to select what you want to scrape.

One option will be from an inspection of the element. Placing ourselves on it, when clicking the right mouse button, we will find the option “To inspect” in the “Elements” section. Locating ourselves in that fragment, having previously selected the element, we can click again on the right mouse button, appearing a screen, where when clicking on “copy”, it will give us the option of “Copy XPath”. Thus, we will obtain the concrete expression to be able to later work on the scraping of said element.

A similar operation will follow the Scraper tool itself, since it will save us this process, directly extracting the expression from XPath, when placing ourselves in the element in question.

Common XPath Expressions for SEO

When performing scraping, it is important to have previously established our objetive. Since this will help us understand what type of elements we want to scrape data on. If the final intention is to improve the On-page SEO of our website, there are certain expressions that will help us to know the specific data on which we want to scrape.

Title: If we want to obtain the title of a specific page: //title
Meta description: When our intention is to get the meta description set for a page: //meta/@content
Headers: If we want to see all the headings that a document presents, differentiating from h1, h2 and h3: //h1, //h2, //h3.
Alt of an image: If we want to extract the alt text from an image: //img/@alt
Images without alt attributes: To extract those images to which we have not established an alternative text: //img/@src
Document links: Extraction of the links that a document presents: //@href
anchor text: Extract the anchor text for a link: //a/text()

Without a doubt, the web scraping It offers us wide possibilities when extracting information from a web page. However, the main task will be the processing and interpretation of the data in the appropriate way. It is important to assess the choice of those websites that we are monitoring, as well as the level of depth and our specific objectives.

Since we are aware of the large amount of content that the Internet world offers us. That is why the strategy when implementing certain web scraping techniques goes through the analysis and interpretation of the results. As we have commented previously, the use of this type of techniques must pass the filter of legality and respect of the RGPDso we recommend that you let yourself be advised by professionals.