Back to Blog

Web Scraping for SEO

Web scraping is the magical act of extracting information from a web page. You can do it on one page or millions of pages. There are multiple reasons why scraping is essential in SEO:

  • We might use it for auditing a website
  • We might need it in the context of programmatic SEO 
  • We could use it for providing context to our web analytics

Here at WordLift, we primarily focus on structured data and improving the data quality of content knowledge graphs. We depend on crawling to cope with missing and messy data on various use cases. 

Extracting Structured Data from Web Pages using Large Language Models

Recently, I’ve been exploring the potential of OpenAI function calling for extracting structured data from web pages. This could be a game-changer for those who, like us, are actively looking to synergize Large Language Models (#LLMs) with Knowledge Graphs (#KGs).

Why is this exciting? Because the integration of LLMs with KGs is fast becoming a hot topic in tech, and developing a unified framework that can enrich both LLMs and KGs simultaneously is of significant importance.

By using this Colab Notebook, you can extract entity attributes from a list of URLs – even from  pages built in JavaScript! I used in this implementation the schema for LodgingBusiness (hotels, b&b and resorts).

A few lessons learned from this exploration:

  1. We can seamlessly extract data from webpages using LLMs.
  2. It’s wise to continue using existing scraping techniques where possible. For instance, BeautifulSoup is excellent for scraping titles and meta descriptions.
  3. Using LLMs is slow and expensive, so optimizing the process is key.
  4. After extraction, it’s crucial to thoroughly check and validate the data to ensure its accuracy and reliability. Data integrity is paramount!

ScrapeGraphAI – the New Frontier in Web Scraping

I have recently discovered a new fantastic library for AI scraping called ScrapeGraphAI. This Python library uses LLM and direct graph logic to create scraping pipelines for websites and any type of document (XML, HTML, JSON, etc.).

This library – at a first glance – has proven to be powerful, adapting seamlessly to various web pages, which prompted me to update the Streamlit web application that you can now immediately use. 

Jump to the web application here [using now ScrapeGraphAI] 🎈

Here is how the scraping app works 

  1. Input your OpenAI API key to enable the AI processing.
  2. Provide the URL of the web page you want to crawl.
  3. Enter your scraping instructions in the form of a user prompt. This could include details like the title, price, or SKU, formatted in a way that guides the AI to understand what data to extract.
  4. Hit “Crawl” and let ScrapeGraphAI analyze the page based on your instructions.
  5. Voilà the work is done, and you can now download a CSV containing, for the page, the required attributes.

Existing limitations

This is a demonstrative web app. The UI is a bit clunky when you start refining rules, and in general, it is limited to crawling only a few URLs. If you are looking for something that scales, I would recommend Advertools, a well-known python library developed by the mythical Elias Dabbas.

If you want to see how you can use it, watch this webinar. Here, Elias Dabbas and Doreid Haddad show how to build a Knowledge Graph using Advertools and WordLift.

Is web scraping illegal?

No, web scraping is, generally, legal, which is why commercial search engines exist. However, there are some considerations to be made:

  1. Some websites might have terms and conditions that do not allow scraping;
  2. Technically speaking, scraping is a task that consumes a significant amount of bandwidth and computational resources. We shall do it only when it is needed. Google itself is reviewing its indexing policies to be more environmentally friendly; we should do it too.
  3. How we use the extracted data makes a huge difference. We want to be respectful of others’ content and aware of potential copyright infringements. 

You can find more useful information around this topic here.

How can we scrape information? 

Here is the thread for you: