In this tutorial, we will go over how to use the Python library Requests to retrieve content from web pages for scraping.
To start, install requests.
pip install requests
Make sure to import requests.
import requestsMost webpages can be scraped/retrieved using the GET method. This is analogous to sending a HTTP GET request.
response = requests.get(url, params=params)To determine the parameters needed, you can visit the website using your web broswer and then examine the end of the URL after the question mark.
For example, the parameters needed for Google Scholar are the language hl and the query q.
These parameters can then be included in a dctionary.
params = {
'hl': 'en', # Set the language to English
'q': 'python', # Keyword to search
}You can then retrieve the content of the webpage, which in most cases is the HTML of the webpage.
content = response.textIf you are retrieving JSON content from an API, you can use the corresponding response.json() method to output a dictionary.
json_data = response.json()
for key in json_data:
print(f"{key}: {json_data[key]}")Some endpoints may use a HTML Form or a JSON object to set the necessary parameters to display the data that you need. In ths case, you can use the POST method.
data = {
'key': 'value'
}
# Send data as HTML form
response = requests.post(url, data=data)
# Send data as JSON
response = requests.post(url, json=data)You can then utilize response in the same way as GET.
To determine the data that you need to send, you can use the Network tab in your browser's developer tools to examine the specific POST request that website sends.
Alternatively, you can use SESSIONS when making multiple requests to the same webpage, or if you want to persist cookies or other information like headers between requests. The syntax is largely the same with the exception of creating and using a session for making requests.
session = requests.Session()
response1 = session.post(url)
response2 = session.get(url)Some websites might employ protection against web scrapers. In those cases, you can add a User-Agent to send the request like a browser.
headers = {
'User-Agent': 'My User Agent 1.0',
}
response = requests.get(url, headers=headers)
# When using Sessions
session = requests.Session()
session.headers.update({'User-Agent': 'My User Agent 1.0'})Some examples of user agents can be found here.
After executing the request, you can check the status_code of the request.
if response.status_code == 200:
# successResponses are classified into five groups:
100level - Informational: the request was received and the process is being continued200level - Successful: the request was successful300level - Redirection: the request requires further action400level - Client Error: the request contains errors or could not be fulfilled500level - Server Error: the server could not fulfill the request
In most cases, you only have to worry about whether the status code matched a value of 200 or not.
Alternatively, you can use Request's built-in raise_for_status(), which will raise an HTTPError if an unsuccessful status code was returned.
try:
response.raise_for_status()
except requests.exceptions.HTTPError as e:
# handle error hereA full list of HTTP response codes can be found here.
With the webpage retrieved, you can now parse the HTML data using a library such as BeautifulSoup4. This will allow load the page into a tree-like structure and allow you to search for and extract specific content from the response (e.g., the text of specific HTML elements). For a more detailed tutorial, you can follow this link.
It is important to note that Requests does not load JavaScript, so the response given by Requests might be different than what you mgiht expect. To work around this, you can consider alternative libraries such as Requests-HTML, or disable JavaScript on your web browser (you can use an extension such as Disable JavaScript or use developer tools in your specific browser) to view the webpage that Requests will deliver.
