We’ve all been in a situation where we had to extract data from a website at some point.
When working on a new account or campaign, you might not have the data or the information available for the creation of the ads, for example.
In an ideal world, we would have been provided with all of the content, landing pages, and relevant information we need, in an easy-to-import format such as a CSV, Excel spreadsheet, or Google Sheet. (Or at the very least, provided what we need as tabbed data that can be imported into one of the aforementioned formats.)
But that’s not always the way it goes.
Those lacking the tools for web scraping — or the coding knowledge to use something like Python to help with the task — may have had to resort to the tedious job of manually copying and pasting possibly hundreds or thousands of entries.
In a recent job, my team was asked to:
- Go to the client’s website.
- Download more than 150 new products spread across 15 different pages.
- Copy and paste the product name and landing page URL for each product into a spreadsheet.
Now, you can imagine how lengthy the task would have been if we’d done just that and manually executed the task.
Not only is it time-consuming, but with someone manually going through that many items and pages and physically having to copy and paste the data product by product, the chances of making a mistake or two are quite high.
It would then require even more time to review the document and make sure it was error-free.
There has to be a better way.
Good news: There is! Let me show you how we did it.
What is IMPORTXML?
Enter Google Sheets. I’d like you to meet the IMPORTXML function.
According to Google’s support page, IMPORTXML “imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.”
Essentially, IMPORTXML is a function allows you to scrape structured data from webpages — no coding knowledge required.
For example, it’s quick and easy to extract data such as page titles, descriptions, or links, but also more complex information.
How Can IMPORTXML Help Scrape Elements of a Webpage?
The function itself is pretty simple and only requires two values:
- The URL of the webpage we intend to extract or scrape the information from.
- And the XPath of the element in which the data is contained.
XPath stands for XML Path Language and can be used to navigate through elements and attributes in an XML document.
For example, to extract the page title from https://en.wikipedia.org/wiki/Moon_landing, we would use:
This will return the value: Moon landing – Wikipedia.
Or, if we are looking for the page description, try this:
Here is a shortlist of some of the most common and useful XPath queries:
- Page title: //title
- Page meta description: //meta[@name=’description’]/@content
- Page H1: //h1
- Page links: //@href
See IMPORTXML in Action
Since discovering IMPORTXML in Google Sheets, it has truly become one of our secret weapons in the automation of many of our daily tasks, from campaign and ads creation to content research, and more.
Moreover, the function combined with other formulas and add-ons can be used for more advanced tasks that otherwise would require sophisticated solutions and development, such as tools built in Python.
But in this instance, we will look at IMPORTXML in its most basic form: scraping data from a web page.
Let’s have a look at a practical example.
Imagine that we’ve been asked to create a campaign for .
They would like us to advertise the last 30 articles that have been published under the PPC section of the website.
A pretty simple task, you might say.
Unfortunately, the editors are not able to send us the data and have kindly asked us to refer to the website to source the information required to set up the campaign.
As mentioned at the beginning of our article, one way to do this would be to open two browser windows — one with the website, and the other with Google Sheets or Excel. We would then start copying and pasting the information over, article by article, and link by link.
But using IMPORTXML in Google Sheets, we can achieve the same output with little to no risk of making mistakes, in a fraction of the time.
Step 1: Start with a Fresh Google Sheet
First, we open a new, blank Google Sheets document:
Step 2: Add the Content You Need to Scrape
Add the URL of the page (or pages) we want to scrape the information from.
In our case, we start with /?s=category/pay-per-click/:
Step 3: Find the XPath
We find the XPath of the element we want to import the content of into our data spreadsheet.
In our example, let’s start with the titles of the latest 30 articles.
Head to Chrome. Once hovering over the title of one of the articles, right-click and select Inspect.
This will open the Chrome Dev Tools window:
Make sure that the article title is still selected and highlighted, then right-click again and choose Copy > Copy XPath.
Step 4: Extract the Data Into Google Sheets
Back in your Google Sheets document, introduce the IMPORTXML function as follows:
A couple of things to note:
First, in our formula, we have replaced the URL of the page with the reference to the cell where the URL is stored (B1).
Second, when copying the XPath from Chrome, this will always be enclosed in double-quotes.
However, in order to make sure it doesn’t break the formula, the double quotes sign will need to be changed to the single quote sign.
Note that in this instance, because the page ID title changes for each article (title_1, title_2, etc), we must slightly modify the query and use “starts-with” in order to capture all elements on the page with an ID that contains ‘title.’
Here is what that looks on the Google Sheets document:
And in just a few moments, this is what the results look like after the query has been loaded the data onto the spreadsheet:
As you can see, the list returns all articles that are featured on the page that we have just scraped (including my previous piece about automation and how to use Ad Customizers to Improve Google Ads campaign performance).
You can apply this to scraping any other piece of information need to set up your ad campaign, as well.
Let’s add the landing page URLs, the featured snippet of each article, and the name of the author into our Sheets document.
For the landing page URLs, we need to tweak the query to specify that we are after the HREF element attached to the article title.
Therefore, our query will look like this:
Now, append ‘/@href’ to the end of the Xpath.
Voila! Straight away, we have the URLs of the landing pages:
You can do the same for the featured snippets and author names:
One thing to beware of is that in order to be able to fully expand and fill in the spreadsheet with all data returned by the query, the column in which the data is populated must have enough cells free and no other data in the way.
This works in a similar way to when we use an ARRAYFORMULA, for the formula to expand there must be no other data in the same column.
And there you have a fully automated, error-free, way to scrape data from (potentially) any webpage, whether you need the content and product descriptions, or ecommerce data such as product price or shipping costs.
In a time when information and data can be the advantage required to deliver better than average results, the ability to scrape web pages and structured content in an easy and quick way can be priceless. Besides, as we have seen above, IMPORTXML can help to cut execution times and reduce the chances of making mistakes.
Additionally, the function is not just a great tool that can be exclusively used for PPC tasks, but instead can be really useful across many different projects that require web scraping, including SEO and content tasks.
All screenshots taken by author, August 2021