Making Your Python Web Scraping Go Live with Pathway

Python web scraping is a powerful technique for automatically extracting information from websites. However, traditional web scraping scripts are designed for static data, causing you to miss out on new content between script executions. Fortunately, Pathway allows you to transform your existing web scraping scripts into efficient, real-time data processing pipelines, ensuring you always have up-to-date content.

In this tutorial, you will learn how to integrate your Python web scraper in Pathway to fetch the content on the websites you want frequently. We provide a simple scraper for news websites as an example. The goal of this tutorial is to show you how to adapt and use your web scraper with Pathway, to have real-time web scraping.

Once ingested in real-time, Pathway allows you to filter, process, and transform the scraped content. You can use LLMs to extract topics, filter content, and notify stakeholders via Slack when new articles are scraped.

You can find the complete source code for this tutorial on GitHub. The repository includes both the web scraper we provide as an example, the Pathway pipeline using it, and a ready-to-use Docker configuration to run the project.

Scope of the Tutorial

This tutorial explores a small and simple use case that involves fetching news articles, then dumping them into a jsonl file. The application will continuously fetch articles from a list of websites.

This tutorial is divided into two parts:

The constraints your Python web scraping script should follow in order to use it with Pathway.
How to build a Pathway connector that uses the scraping script and dumps the results into a jsonl file.

Important Note about Python Web Scraping

Web scraping is difficult, and you can encounter different issues unrelated to Pathway when doing it on many websites for long periods of time. Some of these are:

Repeatedly accessing a website from the same IP address may cause timeouts or IP blocking. In enterprise use, you may need to use proxies.
Many sites implement anti-bot measures like CAPTCHA, JavaScript rendering requirements, or request rate limiting.
Some websites may be unparsable. While newspaper and news-please work well, they can fail on certain sites. In such cases, use Python HTML libraries to build a custom scraper.

This scraper is provided as a simple example for demonstration. In practice, your implementation of the web scraper should be custom-tailored for your use case.

Python Web Scraping for Pathway

The dynamic nature of real-time scraping means your Python web scraping script cannot simply return a list of scraped content, as this content should be updated frequently. The scraping should function as a generator, periodically checking for new content.

Moreover, unlike static scraping, real-time scraping necessitates a system to track already-scraped data, preventing redundant processing and ensuring that new content is efficiently captured without starting from scratch. The scraper consists of two main parts:

Generating a list of articles
Scraping each article's content and metadata

News Scraper Example

We provide a Python web scraper for news here.

Our scraper first generates a list of URLs to be parsed. It eliminates the previously parsed ones from this set. Then, it will parse each URL and update the visited list.

This script relies on a list_articles function which returns a list of article URLs for a given website.

Simple Web Scraping with Python

A function called scrape_articles does real-time web scraping and continuously monitors specified websites for new articles. It uses a generator to periodically check for updates, employing a for loop to iterate over each website. The function tracks already-scraped articles using a set to avoid duplication and yields the URL, text, and metadata of new articles. The scraping process is wrapped inside a while True loop which pauses for a specified refresh interval at the end of each iteration, to scrape the new content at a regular interval.

Here is a simplified version of the scrape_articles:

def scrape_articles(
    website_urls: list[str],
    refresh_interval: int
) -> Generator:

    indexed_articles: set[str] = set() # set of already scraped articles

    while True: # the content will be scraped until the script is stopped
        for website in website_urls:
            articles = list_articles(website) # Obtaining the articles from the website
            for article in articles:
                if article in indexed_article: # Ignore already scraped articles
                    continue
                
                ... # Compute the URL, the text, and the metadata
                
                indexed_articles.add(article) # add the article to the list to avoid scraping it again
                yield {"url": url, "text": text, "metadata": dict(metadata)} # yield the scraped content

        time.sleep(refresh_interval) # restart the scraping after the refresh interval

This example is only a suggestion, but you need to adapt your own scraping script to be sure it is using a generator and it keeps track of the already scraped articles.

Ingesting The Live Scraping in Pathway

Now, we can create a custom Pathway Python connector to ingest the articles generated from the real-time generator.

It is done by implementing a class inheriting from pw.io.python.ConnectorSubject. Here, the connector will scrape the websites and create a Pathway table with the content. It will use the above generator to fetch the articles. This time, Pathway engine will run the connector.

There are many approaches to building a scraper across different domains. This is just a simple example for demonstration purposes—you can customize both the connector and the scraper to fit your specific needs.

import pathway as pw
from pathway.io.python import ConnectorSubject

class NewsScraperSubject(ConnectorSubject):
    _website_urls: list[str]
    _refresh_interval: int

    def __init__(self, website_urls: list[str], refresh_interval: int) -> None:
        super().__init__()
        self._website_urls = website_urls
        self._refresh_interval = refresh_interval

    def run(self) -> None:
        for article in scrape_articles(
            self._website_urls,
            refresh_interval=self._refresh_interval,
        ):
            url = article["url"]
            text = article["text"]
            metadata = article["metadata"]

            # give the keyword arguments according to the pw.Schema we will define
            # in each iteration, the new article will be appended to the table
            self.next(url=url, text=text, metadata=metadata)

You can now utilize the previously defined subject to create an input table using pw.io.python.read. As the scraper fetches new content, rows will be appended to the table.

Running the Scraper

You should set the list of websites to scrape and the refresh interval.

Here, let's fetch the content of BBC news:

websites = ["https://www.bbc.com/"]

# url should be unique
class ConnectorSchema(pw.Schema):
    url: str = pw.column_definition(primary_key=True)
    text: str
    metadata: dict

subject = NewsScraperSubject(
    website_urls=websites,
    refresh_interval=25,  # if you are encountering API throttling, increase this value, default is 600
)

web_articles = pw.io.python.read(subject, schema=ConnectorSchema)

Now, dump the articles into a jsonl file using pw.io.jsonlines.write

pw.io.jsonlines.write(web_articles, "scraped_web_articles.jsonl")

Run the Pathway with pw.run:

pw.run()

Alternatively, you can disable the monitoring to see the logs more clearly:

pw.run(monitoring_level=pw.MonitoringLevel.NONE)

If you peek inside scraped_web_articles.jsonl, you will see something like this:

{"_metadata":{"authors":["Patrick Barlow"],"date_download":"2025-03-06 11:53:07","date_modify":"None","date_publish":"2025-03-06 10:10:15","description":"Bedgebury National Pinetum is celebrating 100 years of conservation work in 2025.","filename":"https%3A%2F%2Fwww.bbc.com%2Fnews%2Farticles%2Fcm2n07g4ee1o.json","language":"en","source_domain":"www.bbc.com","title":"Bedgebury Pinetum celebrates 100 years with forest trail","title_page":null,"title_rss":null,"url":"https://www.bbc.com/news/articles/cm2n07g4ee1o"},"text":"A Kent arboretum holding the world's largest collection of conifers is celebrating its centenary with the launch of a new trail.\nBedgebury National Pinetum near Goudhurst holds around 12,000 trees and will celebrate 100 years of conservation works having first opened in 1925.\nThe arboretum has launched a new walking trail through the site which will share the stories behind Bedgebury and the people who have contributed to its success.\nThe trail marks the first of a slate of centenary events including the ceremonial planting of a Japanese hemlock tree to commemorate the centenary on 19 March.\nWinding through the forest, the 1.6 mile (2.57km) trail will include stops to learn about the history of the country estate.\nThe collection began when William Dallimore, Bedgebury's first curator, planted a Japanese hemlock tree in the grounds.\nThe same species of tree, grown from seeds collected in the wild in Japan, will be planted on 19 March to mirror the planting 100 years on.\nJonathan Codd, manager of Bedgebury for Forestry England who look after the site, said: \"This year's special events offer unique opportunities to discover the fascinating stories behind our trees and understand how our research today is helping create climate-resilient forests for future generations.\"\nBedgebury welcomes over 500,000 visitors each year and grows more than 2,000 trees and shrubs from seedlings annually to be planted on the estate or sent to other botanic gardens in the UK and Europe.","diff":1,"time":1741258391274}

Conclusion

Scraping the web in real-time is a challenging task. Pathway helps you by providing a framework to build powerful real-time data processing pipelines.

In this tutorial, you learned how to build a Pathway connector that scrapes the web to fetch content from websites and dump it into a jsonl file. Provided scraper is a simple example that can be adapted to your specific needs. Pathway also allows you to filter, process, and transform the scraped content in real-time, as the data arrives. This let's you build powerful applications that can be also supplemented with LLMs or ML models.

For example, you can send the scraped articles to a messaging service such as Slack, see here. Variety of connectors enable powerful workflows, you can find the full list of connectors here.

You can also check out the subscribe guide to see how you can extend this with custom logic, triggering operations on each row addition.

If you are curious about the table operations, check out the table API documentation.