Playwright Async API (Python): Scraping Google Search Results for Cat Memes

Python Playwright Async API is a powerful end-to-end testing and web scraping library for Python. It allows you to write tests and scraping scripts that run across multiple browsers (Chromium, Firefox, and WebKit) and platforms (Windows, macOS, and Linux) in an asynchronous manner. Playwright provides a comprehensive API that gives you full control over the browser and allows you to perform a wide range of actions, including:

Navigating to URLs
Filling out forms
Clicking buttons
Interacting with the console
Taking screenshots
Scrolling the page
Hovering over elements
Drag and drop operations
Network interception
Mocking geolocation
And much more

In this blog post, we will show you how to use Python Playwright Async API to scrape Google search results for cat memes, take a screenshot of the search results page, and generate a list of links to the search results. We will also show you how to visit each page in the loop and extract only the text, cleaning HTML tags and other characters, and storing it to a dictionary.

Installation

To install Python Playwright Async API, simply run the following command:


pip install playwright

Usage

To use Python Playwright Async API, you first need to create a Playwright context. A context represents a single browser instance and can be used to create pages, which represent individual tabs within the browser.


import asyncio
from playwright.async_api import async_playwright


async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()


        # Navigate to Google
        await page.goto("https://google.com")


        # Search for cat memes
        await page.fill("#search-box", "cat memes")
        await page.click("#search-button")


        # Wait for the search results to load
        await page.wait_for_selector(".result")


        # Get the search results
        search_results = await page.query_selector_all(".result")


        # Create a list of links to the search results
        links = [result.get_attribute("href") for result in search_results]


        # Print the list of links
        print(links)


        # Take a screenshot of the search results page
        await page.screenshot(path="search-results.png")


        # Visit each page in the loop
        for link in links:
            # Navigate to the page
            await page.goto(link)


            # Wait for the page to load
            await page.wait_for_selector("body")


            # Get the text from the page
            text = await page.text("body")


            # Remove HTML tags using a regular expression
            cleaned_text = re.sub(r"<.*?>", "", text)


            # Strip leading and trailing whitespace
            cleaned_text = cleaned_text.strip()


            # Store the text to a dictionary
            results[link] = cleaned_text


    asyncio.run(main())

Conclusion

Python Playwright Async API is a powerful and versatile library that can be used for a variety of tasks, including web scraping. In this blog post, we showed you how to use Python Playwright Async API to scrape Google search results for cat memes, take a screenshot of the search results page, and generate a list of links to the search results. We also showed you how to visit each page in the loop and extract only the text, cleaning HTML tags and other characters, and storing it to a dictionary.

blog oofdev

Search This Blog