Skip to content

Latest commit

 

History

History
127 lines (104 loc) · 5.23 KB

File metadata and controls

127 lines (104 loc) · 5.23 KB

Python Web Scraping with Requests

In this tutorial, we will go over how to use the Python library Requests to retrieve content from web pages for scraping.

Table of Contents

Installation

To start, install requests.

pip install requests

Basic Usage

Make sure to import requests.

import requests

GET

Most webpages can be scraped/retrieved using the GET method. This is analogous to sending a HTTP GET request.

response = requests.get(url, params=params)

To determine the parameters needed, you can visit the website using your web broswer and then examine the end of the URL after the question mark.

For example, the parameters needed for Google Scholar are the language hl and the query q.

Google Scholar URL

These parameters can then be included in a dctionary.

params = {
  'hl': 'en',     # Set the language to English
  'q': 'python',  # Keyword to search
}

You can then retrieve the content of the webpage, which in most cases is the HTML of the webpage.

content = response.text

If you are retrieving JSON content from an API, you can use the corresponding response.json() method to output a dictionary.

json_data = response.json()

for key in json_data:
  print(f"{key}: {json_data[key]}")

POST

Some endpoints may use a HTML Form or a JSON object to set the necessary parameters to display the data that you need. In ths case, you can use the POST method.

data = {
  'key': 'value'
}

# Send data as HTML form
response = requests.post(url, data=data)

# Send data as JSON
response = requests.post(url, json=data)

You can then utilize response in the same way as GET.

To determine the data that you need to send, you can use the Network tab in your browser's developer tools to examine the specific POST request that website sends.

Sessions

Alternatively, you can use SESSIONS when making multiple requests to the same webpage, or if you want to persist cookies or other information like headers between requests. The syntax is largely the same with the exception of creating and using a session for making requests.

session = requests.Session()
response1 = session.post(url)
response2 = session.get(url)

User-Agent

Some websites might employ protection against web scrapers. In those cases, you can add a User-Agent to send the request like a browser.

headers = {
    'User-Agent': 'My User Agent 1.0',
}

response = requests.get(url, headers=headers)

# When using Sessions
session = requests.Session()
session.headers.update({'User-Agent': 'My User Agent 1.0'})

Some examples of user agents can be found here.

Status Code

After executing the request, you can check the status_code of the request.

if response.status_code == 200:
  # success

Responses are classified into five groups:

  • 100 level - Informational: the request was received and the process is being continued
  • 200 level - Successful: the request was successful
  • 300 level - Redirection: the request requires further action
  • 400 level - Client Error: the request contains errors or could not be fulfilled
  • 500 level - Server Error: the server could not fulfill the request

In most cases, you only have to worry about whether the status code matched a value of 200 or not.

Alternatively, you can use Request's built-in raise_for_status(), which will raise an HTTPError if an unsuccessful status code was returned.

try:
  response.raise_for_status()
except requests.exceptions.HTTPError as e:
  # handle error here

A full list of HTTP response codes can be found here.

Parsing

With the webpage retrieved, you can now parse the HTML data using a library such as BeautifulSoup4. This will allow load the page into a tree-like structure and allow you to search for and extract specific content from the response (e.g., the text of specific HTML elements). For a more detailed tutorial, you can follow this link.

JavaScript

It is important to note that Requests does not load JavaScript, so the response given by Requests might be different than what you mgiht expect. To work around this, you can consider alternative libraries such as Requests-HTML, or disable JavaScript on your web browser (you can use an extension such as Disable JavaScript or use developer tools in your specific browser) to view the webpage that Requests will deliver.

Useful Resources