Skip to main content

Playwright Async API (Python): Scraping Google Search Results for Cat Memes

Python Playwright Async API is a powerful end-to-end testing and web scraping library for Python. It allows you to write tests and scraping scripts that run across multiple browsers (Chromium, Firefox, and WebKit) and platforms (Windows, macOS, and Linux) in an asynchronous manner. Playwright provides a comprehensive API that gives you full control over the browser and allows you to perform a wide range of actions, including:

  • Navigating to URLs
  • Filling out forms
  • Clicking buttons
  • Interacting with the console
  • Taking screenshots
  • Scrolling the page
  • Hovering over elements
  • Drag and drop operations
  • Network interception
  • Mocking geolocation
  • And much more

In this blog post, we will show you how to use Python Playwright Async API to scrape Google search results for cat memes, take a screenshot of the search results page, and generate a list of links to the search results. We will also show you how to visit each page in the loop and extract only the text, cleaning HTML tags and other characters, and storing it to a dictionary.


Installation

To install Python Playwright Async API, simply run the following command:


pip install playwright


Usage

To use Python Playwright Async API, you first need to create a Playwright context. A context represents a single browser instance and can be used to create pages, which represent individual tabs within the browser.


import asyncio

from playwright.async_api import async_playwright


async def main():

    async with async_playwright() as p:

        browser = await p.chromium.launch()

        page = await browser.new_page()


        # Navigate to Google

        await page.goto("https://google.com")


        # Search for cat memes

        await page.fill("#search-box", "cat memes")

        await page.click("#search-button")


        # Wait for the search results to load

        await page.wait_for_selector(".result")


        # Get the search results

        search_results = await page.query_selector_all(".result")


        # Create a list of links to the search results

        links = [result.get_attribute("href") for result in search_results]


        # Print the list of links

        print(links)


        # Take a screenshot of the search results page

        await page.screenshot(path="search-results.png")


        # Visit each page in the loop

        for link in links:

            # Navigate to the page

            await page.goto(link)


            # Wait for the page to load

            await page.wait_for_selector("body")


            # Get the text from the page

            text = await page.text("body")


            # Remove HTML tags using a regular expression

            cleaned_text = re.sub(r"<.*?>", "", text)


            # Strip leading and trailing whitespace

            cleaned_text = cleaned_text.strip()


            # Store the text to a dictionary

            results[link] = cleaned_text


    asyncio.run(main())


Conclusion

Python Playwright Async API is a powerful and versatile library that can be used for a variety of tasks, including web scraping. In this blog post, we showed you how to use Python Playwright Async API to scrape Google search results for cat memes, take a screenshot of the search results page, and generate a list of links to the search results. We also showed you how to visit each page in the loop and extract only the text, cleaning HTML tags and other characters, and storing it to a dictionary.

Comments

Topics

Show more