Python Playwright Async API is a powerful end-to-end testing and web scraping library for Python. It allows you to write tests and scraping scripts that run across multiple browsers (Chromium, Firefox, and WebKit) and platforms (Windows, macOS, and Linux) in an asynchronous manner. Playwright provides a comprehensive API that gives you full control over the browser and allows you to perform a wide range of actions, including:
- Navigating to URLs
- Filling out forms
- Clicking buttons
- Interacting with the console
- Taking screenshots
- Scrolling the page
- Hovering over elements
- Drag and drop operations
- Network interception
- Mocking geolocation
- And much more
In this blog post, we will show you how to use Python Playwright Async API to scrape Google search results for cat memes, take a screenshot of the search results page, and generate a list of links to the search results. We will also show you how to visit each page in the loop and extract only the text, cleaning HTML tags and other characters, and storing it to a dictionary.
Installation
To install Python Playwright Async API, simply run the following command:
pip install playwright
Usage
To use Python Playwright Async API, you first need to create a Playwright context. A context represents a single browser instance and can be used to create pages, which represent individual tabs within the browser.
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
# Navigate to Google
await page.goto("https://google.com")
# Search for cat memes
await page.fill("#search-box", "cat memes")
await page.click("#search-button")
# Wait for the search results to load
await page.wait_for_selector(".result")
# Get the search results
search_results = await page.query_selector_all(".result")
# Create a list of links to the search results
links = [result.get_attribute("href") for result in search_results]
# Print the list of links
print(links)
# Take a screenshot of the search results page
await page.screenshot(path="search-results.png")
# Visit each page in the loop
for link in links:
# Navigate to the page
await page.goto(link)
# Wait for the page to load
await page.wait_for_selector("body")
# Get the text from the page
text = await page.text("body")
# Remove HTML tags using a regular expression
cleaned_text = re.sub(r"<.*?>", "", text)
# Strip leading and trailing whitespace
cleaned_text = cleaned_text.strip()
# Store the text to a dictionary
results[link] = cleaned_text
asyncio.run(main())
Conclusion
Python Playwright Async API is a powerful and versatile library that can be used for a variety of tasks, including web scraping. In this blog post, we showed you how to use Python Playwright Async API to scrape Google search results for cat memes, take a screenshot of the search results page, and generate a list of links to the search results. We also showed you how to visit each page in the loop and extract only the text, cleaning HTML tags and other characters, and storing it to a dictionary.
Comments
Post a Comment
Oof!