Web scraping is the magical act of extracting information from a web page. You can do it on one page or millions of pages. There are multiple reasons why scraping is essential in SEO:
- We might use it for auditing a website
- We might need it in the context of programmatic SEO
- We could use it for providing context to our web analytics
Here at WordLift, we primarily focus on structured data and improving the data quality of content knowledge graphs. We depend on crawling to cope with missing and messy data on various use cases.
This blog post will introduce you to a new web scraping Python library developed by Alireza Mika called AutoScraper that will make your web scraping fast, simple, and fun. All the credit goes to him for bringing innovation to a sector that isn’t evolving as fast as you think.
If you are interested in using the library in Python, I suggest you read Ali’s blog post on Medium.
I found this tool very powerful, yet limited to only some use-cases, and I decided to build a simple Streamlit web application that you can immediately use.
Jump to the web application here 🎈
Here is how the scraping app works
- You provide the URL of the web page used as a template. I am using a product page on our E-commerce demo site as a reference.
- You add a list of information (comma separated) that you expect to scrape from that page. Here you can add anything, a snippet of text, the URL of an image, or the structured data property present in the markup. I am adding the title, the price, and the SKU in this example.
- You finally hit “Train” and let AutoScraper learn to extract these attributes from similar pages.
You can choose to let AutoScraper run under the assumption that all pages will be the same (choose “exact”) or that they will have a similar structure (choose “similar” instead).
- You can now add a list of pages that you would like to scrape. I have added two samples here. Keep in mind that there is a limit to the total number of characters that you can add (and therefore to the total number of URLs that you can scrape). This is a demonstration tool and shall be used only for a limited set of pages.
- Voilà the work is done, and you can now download a CSV containing, for each URL, price, SKU, and product name.
How to refine the results
In some cases, we might get false positives; in other words, AutoScraper might extract data that we don’t need. In these cases, we’ll need to revise the set of rules that have been identified and keep just what we need. Let’s review an example.
- If we add the URL of the image behind the reference product in the list of attributes that want to extract, we will get a table with an unneeded column (column 4).
- We can now refine the rules by clicking on the “Refine Results” button. Here we can see that if we remove rule_zk7p and hit “Crawl” again, we now have the correct table without column 4.
This is a demonstrative web app. The UI is a bit clunky when you start refining rules, and in general, it is limited to crawling only a few URLs. If you are looking for something that scales, I would recommend Advertools, a well-known python library developed by the mythical Elias Dabbas.
If you want to see how you can use it, watch this webinar. Here, Elias Dabbas and Doreid Haddad show how to build a Knowledge Graph using Advertools and WordLift.
Is web scraping illegal?
No, web scraping is, generally, legal, which is why commercial search engines exist. However, there are some considerations to be made:
- Some websites might have terms and conditions that do not allow scraping;
- Technically speaking, scraping is a task that consumes a significant amount of bandwidth and computational resources. We shall do it only when it is needed. Google itself is reviewing its indexing policies to be more environmentally friendly; we should do it too.
- How we use the extracted data makes a huge difference. We want to be respectful of others’ content and aware of potential copyright infringements.
You can find more useful information around this topic here.
How can we scrape information?
Here is the thread for you:
Must Read Content
The Power of Product Knowledge Graph for E-commerce
Dive deep into the power of data for e-commerce
Why Do We Need Knowledge Graphs?
Learn what a knowledge graph brings to SEO with Teodora Petkova
Generative AI for SEO: An Overview
Use videos to increase traffic to your websites
SEO Automation in 2023
Improve the SEO of your website through Artificial Intelligence
Touch your SEO: Introducing Physical SEO
Connect a physical product to the ecosystem of data on the web