Webscraping Python



Introduction

We'll cover how to use Headless Chrome for web scraping Google Places. Google places does not necessarily require javascript because google will serve a different response if you disable javascript. But for better user emulation when browsing/scraping google places, a browser is recommended.

Well, “Web Scraping” is the answer. Web Scraping just makes this job easier and faster. In this article on Web Scraping with Python, you will learn about web scraping in brief and see how to extract data from a website with a demonstration. I will be covering the following topics. CDP is essentially a websocket server running on the browser that is based on JSONRPC. Instead of directly working with CDP we'll use a library called pyppeteer which is a python implementation of the CDP protocol that provides an easier to use abstraction. It's inspired by the Node version of the same library called puppeteer. Web Scraping Using Python What is Web Scraping? Web Scraping is a technique to extract a large amount of data from several websites. The term 'scraping' refers to obtaining the information from another source (webpages) and saving it into a local file. For example: Suppose you are working on a project called 'Phone comparing website,' where you require the price of mobile phones, ratings,. CDP is essentially a websocket server running on the browser that is based on JSONRPC. Instead of directly working with CDP we'll use a library called pyppeteer which is a python implementation of the CDP protocol that provides an easier to use abstraction. It's inspired by the Node version of the same library called puppeteer. Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort. In this article, we will go through an easy example of how to automate downloading hundreds of files from the New York MTA.

Headless Chrome is essentially the Chrome browser running without a head (no graphical user interface). The benefit being you can run a headless browser on a server environment that also has no graphical interface attached to it, which is normally accessed through shell access. It can also be faster to run headless and can have lower overhead on system resources.

Controlling a browser

We need a way to control the browser with code, this can be done through what is called the Chrome DevTools Protocol or CDP. CDP is essentially a websocket server running on the browser that is based on JSONRPC. Instead of directly working with CDP we'll use a library called pyppeteer which is a python implementation of the CDP protocol that provides an easier to use abstraction. It's inspired by the Node version of the same library called puppeteer.

Setting up

As usual with any of my python projects, I recommend working in a virtual python environment which helps us address dependencies and versions separately for each application / project. Let's create a virtual environment in our home directory and install the dependencies we need.

Make sure you are running at least python 3.6.1, 3.5 is end of support.The pyppeteer library will not work with python 3.6.0, this is due to the websockets library that it depends on not supporting that python version.

Let's create the following folders and files.

We created a __main__.py file, this lets us run the Google Places scraper with the following command (nothing should happen right now):

Launching a headless browser

We need to launch a Chrome browser. By default, pyppeteer will install the latest version of Chromium. It's also possible to just use Chrome as long as it is installed on your system. The library makes use of async/await for concurrency. In order to use this we import the asyncio package from python.

To launch with Chrome instead of Chromium add executablePath option to the launch function. Below, we launch the browser, navigate to google and take a screenshot. The screenshot will be saved in the folder you are running the scraper.

Digging in

Let's create some functions in core/browser.py to simplify working with a browser and the page. We'll make use of what I believe is an awesome feature in python for simplifying management of resources called context manager. Specifically we will use an async context manager.

An asynchronous context manager is a context manager that is able to suspend execution in its enter and exit methods.

This feature in python lets us write code like the below which handles opening and closing a browser with one line.

Let's add the PageSession async context manager in the file core/browser.py.

In our google-places/__main__.py file let's make use of our new PageSession and print the html content of the final rendered page with javascript executed.

Run the google-places module in your terminal with the same command we used earlier.

So now we can launch a browser, open a page (a tab in chrome) and navigate to a website and wait for javascript to finish loading/executing then close the browser with the above code.

Next let's do the following:

  • We want to visit google.com
  • Enter a search query for pediatrician near 94118
  • Click on google places to see more results
  • Scrape results from the page
  • Save results to a CSV file

Navigating pages

We want to end up on the following page navigations so we can pull the data we need.

Let's start by breaking up our code in google-places/__main__.py so we can first search then navigate to google places. We also want to clean up some of the string literals like the google url.

You can see the new code we added above as it has been highlighted. We use XPath to find the search bar, the search button and the view all button to get us to google places.

  1. Type in the search bar
  1. Click the search button
  1. Wait for the view all button to appear
  1. Click view all button to take us to google places
  1. Wait for an element on the new page to appear

Scraping the data with Pyppeteer

At this point we should be on the google places page and we can pull the data we want. The navigation flow we followed before is important for emulating a user.

Let's define the data we want to pull from the page.

  • Name
  • Location
  • Phone
  • Rating
  • Website Link

In core/browser.py let's add two methods to our PageSession to help us grab the text and an attribute (the website link for the doctor).

So we added get_text and get_link. These two methods will evaluate javascript on the browser, the same way if you were to type it on the Chrome console. You can see that they just use the DOM to grab the text of the element or the href attribute.

In google-places/__main__.py we will add a few functions that will grab the content that we care about from the page.

We make use of XPath to grab the elements. You can practice XPath in your Chrome browser by pressing F12 or right-clicking inspect to open the console.Why do I use XPath? It's easier to specify complex selectors because XPath has built in functions for handling things like finding elements which contain some text or traversing the tree in various ways.

For the phone, rating and link fields we default to None and substitute with 'N/A' because not all doctors have a phone number listed, a rating or a link. All of them seem to have a location and a name.

Because there are many doctors listed on the page we want to find the parent element and loop over each match, then evaluate the XPath we defined above.To do this let's add two more functions to tie it all together.

The entry point here is scrape_doctors which evaluates get_doctor_details on each container element.

In the code below, we loop over each container element that matched our XPath and we get back a Future object by calling the function get_doctor_details.Because we don't use the await keyword, we get back a Future object which can be used by the asyncio.gather call to evaluate all Future objects in the tasks list.

This line allows us to wait for all async calls to finish concurrently.

Let's put this together in our main function. First we search and crawl to the right page, then we scrape with scrape_doctors.

Saving the output

In core/utils.py we'll add two functions to help us save our scraped output to a local CSV file.

Let's import it in google-places/__main__.py and save the output of scrape_doctors from our main function.

We should now have a file called pediatricians.csv which contains our output.

Wrapping up

From this guide we should have learned how to use a headless browser to crawl and scrape google places while emulating a real user.There's a lot more you can do with headless browsers such as generate pdfs, screenshots and other automation tasks.

Hopefully this guide helped you get started executing javascript and scraping with a headless browser. Till next time!

Internet extends fast and modern websites pretty often use dynamic content load mechanisms to provide the best user experience. Still, on the other hand, it becomes harder to extract data from such web pages, as it requires the execution of internal Javascript in the page context while scraping. Let's review several conventional techniques that allow data extraction from dynamic websites using Python.

What is a dynamic website?#

A dynamic website is a type of website that can update or load content after the initial HTML load. So the browser receives basic HTML with JS and then loads content using received Javascript code. Such an approach allows increasing page load speed and prevents reloading the same layout each time you'd like to open a new page.

Usually, dynamic websites use AJAX to load content dynamically, or even the whole site is based on a Single-Page Application (SPA) technology.

In contrast to dynamic websites, we can observe static websites containing all the requested content on the page load.

A great example of a static website is example.com:

The whole content of this website is loaded as a plain HTML while the initial page load.

To demonstrate the basic idea of a dynamic website, we can create a web page that contains dynamically rendered text. It will not include any request to get information, just a render of a different HTML after the page load:

<head>
<script>
window.addEventListener('DOMContentLoaded',function(){
document.getElementById('test').innerHTML='I ❤️ ScrapingAnt'
</script>
<body>
</body>

All we have here is an HTML file with a single <div> in the body that contains text - Web Scraping is hard, but after the page load, that text is replaced with the text generated by the Javascript:

window.addEventListener('DOMContentLoaded',function(){
document.getElementById('test').innerHTML='I ❤️ ScrapingAnt'
</script>

To prove this, let's open this page in the browser and observe a dynamically replaced text:

Python

Alright, so the browser displays a text, and HTML tags wrap this text.
Can't we use BeautifulSoup or LXML to parse it? Let's find out.

Extract data from a dynamic web page#

BeautifulSoup is one of the most popular Python libraries across the Internet for HTML parsing. Almost 80% of web scraping Python tutorials use this library to extract required content from the HTML.

Let's use BeautifulSoup for extracting the text inside <div> from our sample above.

import os
soup = BeautifulSoup(test_file)

This code snippet uses os library to open our test HTML file (test.html) from the local directory and creates an instance of the BeautifulSoup library stored in soup variable. Using the soup we find the tag with id test and extracts text from it.

In the screenshot from the first article part, we've seen that the content of the test page is I ❤️ ScrapingAnt, but the code snippet output is the following:

And the result is different from our expectation (except you've already found out what is going on there). Everything is correct from the BeautifulSoup perspective - it parsed the data from the provided HTML file, but we want to get the same result as the browser renders. The reason is in the dynamic Javascript that not been executed during HTML parsing.

We need the HTML to be run in a browser to see the correct values and then be able to capture those values programmatically.

Python Web Scraping Tools

Below you can find four different ways to execute dynamic website's Javascript and provide valid data for an HTML parser: Selenium, Pyppeteer, Playwright, and Web Scraping API.

Selenuim: web scraping with a webdriver#

Selenium is one of the most popular web browser automation tools for Python. It allows communication with different web browsers by using a special connector - a webdriver.

To use Selenium with Chrome/Chromium, we'll need to download webdriver from the repository and place it into the project folder. Don't forget to install Selenium itself by executing:

Selenium instantiating and scraping flow is the following:

  • define and setup Chrome path variable
  • define and setup Chrome webdriver path variable
  • define browser launch arguments (to use headless mode, proxy, etc.)
  • instantiate a webdriver with defined above options
  • load a webpage via instantiated webdriver

In the code perspective, it looks the following:

from selenium.webdriver.chrome.options import Options
import os
opts = Options()
# opts.add_argument(' — headless') # Uncomment if the headless version needed
opts.binary_location ='<path to Chrome executable>'
# Set the location of the webdriver
chrome_driver = os.getcwd()+'<Chrome webdriver filename>'
# Instantiate a webdriver
driver = webdriver.Chrome(options=opts, executable_path=chrome_driver)
# Load the HTML page
soup = BeautifulSoup(driver.page_source)

And finally, we'll receive the required result:

Selenium usage for dynamic website scraping with Python is not complicated and allows you to choose a specific browser with its version but consists of several moving components that should be maintained. The code itself contains some boilerplate parts like the setup of the browser, webdriver, etc.

I like to use Selenium for my web scraping project, but you can find easier ways to extract data from dynamic web pages below.

Pyppeteer: Python headless Chrome#

Pyppeteer is an unofficial Python port of Puppeteer JavaScript (headless) Chrome/Chromium browser automation library. It is capable of mainly doing the same as Puppeteer can, but using Python instead of NodeJS.

Puppeteer is a high-level API to control headless Chrome, so it allows you to automate actions you're doing manually with the browser: copy page's text, download images, save page as HTML, PDF, etc.

To install Pyppeteer you can execute the following command:

The usage of Pyppeteer for our needs is much simpler than Selenium:

from bs4 import BeautifulSoup
import os
# Launch the browser
page =await browser.newPage()
# Create a URI for our test file
await page.goto(page_path)
soup = BeautifulSoup(page_content)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())

I've tried to comment on every atomic part of the code for a better understanding. However, generally, we've just opened a browser page, loaded a local HTML file into it, and extracted the final rendered HTML for further BeautifulSoup processing.

As we can expect, the result is the following:

We did it again and not worried about finding, downloading, and connecting webdriver to a browser. Though, Pyppeteer looks abandoned and not properly maintained. This situation may change in the nearest future, but I'd suggest looking at the more powerful library.

Playwright: Chromium, Firefox and Webkit browser automation#

Web Scraping Python Github

Playwright can be considered as an extended Puppeteer, as it allows using more browser types (Chromium, Firefox, and Webkit) to automate modern web app testing and scraping. You can use Playwright API in JavaScript & TypeScript, Python, C# and, Java. And it's excellent, as the original Playwright maintainers support Python.

The API is almost the same as for Pyppeteer, but have sync and async version both.

Installation is simple as always:

playwright install

Let's rewrite the previous example using Playwright.

from playwright.sync_api import sync_playwright
with sync_playwright()as p:
browser = p.chromium.launch()
# Open a new browser page
page_path ='file://'+ os.getcwd()+'/test.html'
# Open our test file in the opened page
page_content = page.content()
# Process extracted content with BeautifulSoup
print(soup.find(id='test').get_text())
# Close browser

As a good tradition, we can observe our beloved output:

Python

We've gone through several different data extraction methods with Python, but is there any more straightforward way to implement this job? How can we scale our solution and scrape data with several threads?

Meet the web scraping API!

Web scraping python projects

Web Scraping API#

ScrapingAnt web scraping API provides an ability to scrape dynamic websites with only a single API call. It already handles headless Chrome and rotating proxies, so the response provided will already consist of Javascript rendered content. ScrapingAnt's proxy poll prevents blocking and provides a constant and high data extraction success rate.

Usage of web scraping API is the simplest option and requires only basic programming skills.

You do not need to maintain the browser, library, proxies, webdrivers, or every other aspect of web scraper and focus on the most exciting part of the work - data analysis.

As the web scraping API runs on the cloud servers, we have to serve our file somewhere to test it. I've created a repository with a single file: https://github.com/kami4ka/dynamic-website-example/blob/main/index.html

To check it out as HTML, we can use another great tool: HTMLPreview

The final test URL to scrape a dynamic web data has a following look: http://htmlpreview.github.io/?https://github.com/kami4ka/dynamic-website-example/blob/main/index.html

The scraping code itself is the simplest one across all four described libraries. We'll use ScrapingAntClient library to access the web scraping API.

Let's install in first:

And use the installed library:

from scrapingant_client import ScrapingAntClient
# Define URL with a dynamic web content
url ='http://htmlpreview.github.io/?https://github.com/kami4ka/dynamic-website-example/blob/main/index.html'
# Create a ScrapingAntClient instance
Web
client = ScrapingAntClient(token='<YOUR-SCRAPINGANT-API-TOKEN>')

Web Scraping Python Tutorial

# Get the HTML page rendered content
page_content = client.general_request(url).content
# Parse content with BeautifulSoup
print(soup.find(id='test').get_text())

To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.

Web Scraping Python Links

And the result is still the required one.

Web Scraping Python Pdf

All the headless browser magic happens in the cloud, so you need to make an API call to get the result.

Check out the documentation for more info about ScrapingAnt API.

Summary#

Today we've checked four free tools that allow scraping dynamic websites with Python. All these libraries use a headless browser (or API with a headless browser) under the hood to correctly render the internal Javascript inside an HTML page. Below you can find links to find out more information about those tools and choose the handiest one:

Happy web scraping, and don't forget to use proxies to avoid blocking 🚀