GAMR1520: Markup languages and scripting

python logo

Lab 3.1: Accessing data via HTTP

Part of Week 3: Practical python, some useful libraries

General setup

For all lab exercises you should create a folder for the lab somewhere sensible.

Assuming you have a GAMR1520-labs folder, you should create a GAMR1520-labs/week_3 folder for this week and a GAMR1520-labs/week_3/lab_3.1 folder inside that.

Create a collection of scripts. If necessary, use folders to group related examples.

GAMR1520-labs
└─ week_3
    └─ lab_3.1
        ├─ experiment1.py
        └─ experiment2.py

Try to name your files better than this, the filename should reflect their content. For example, string_methods_.py, conditionals.py or while_loops.py.

Make sure your filenames give clues about what is inside each file. A future version of you will be looking back at these, trying to remember where a particular example was.

General approach

As you encounter new concepts, try to create examples for yourself that prove you understand what is going on. Try to break stuff, its a good way to learn. But always save a working version.

Modifying the example code is a good start, but try to write your own programmes from scratch, based on the example code. They might start very simple, but over the weeks you can develop them into more complex programmes.

Think of a programme you would like to write (don't be too ambitious). Break the problem down into small pieces and spend some time each session trying to solve a small problem.

In this set of exercises we will load data from the web. Beginning with a simple HTML example we will see how easy it is to load data into python.

We will then move on to using a JSON API. The dataset we will be using comes from a star wars API (swapi) which provides access to json-formatted data relating to the star wars films.

We will develop some moderately complex code to access the data via HTTP requests, interpret them as json formatted data and store them in files. Finally, we will write functions to extract specific details from the data and generate some outputs.

Making HTTP requests with urllib

To make HTTP requests we can import functions from the urllib package, which is part of the python standard library.

The standard library includes vast amounts of useful python code maintained by the core python developers. We will only scratch the surface of the standard library. We’ve already looked at json and pathlib, now we are looking briefly at the urllib.request module.

Making simple HTTP requests is pretty easy. We can just import the urlopen function from the urllib.request module and call it with a url as an argument.

from urllib.request import urlopen

url = 'http://gamr1520.github.io/GAMR1520/exercises/2.1.html'
response = urlopen(url)

In the above code, we are requesting the HTML file containing the previous exercise. The request generates an http.client.HTTPResponse object.

Our response variable is assigned to an instance of the built in HTTPResponse class which can be found in the client module of the http package.

To access the content of the response, we need to call the HTTPResponse.read() method.

from urllib.request import urlopen

url = 'http://gamr1520.github.io/GAMR1520/exercises/2.1.html'
response = urlopen(url)
data = response.read().decode('utf-8')

print(data[:95])

Calling read on the response object extracts the body of the response as a sequence of bytes. Notice we are converting this into a string using utf-8 encoding. We do this on one line because we can.

As you may notice, urllib requires a bit of work to use it. The urllib3 and requests libraries are popular additions which provide more power and a cleaner API.

Finally, we print a small chunk of the file. The output looks like this (at the time of writing).

<!DOCTYPE html>
<html lang="en">
<head>
<title>Files and folders</title>
<meta charset="utf-8">

So, with a small amount of work, we can access data from anywhere on the public web.

Make an HTTP request

Take the example code and modify it.

  1. Write a script to download a website of your choice.
  2. Write a simple function that takes a url as an argument and returns the data as a utf-8 encoded string.

If you wanted to parse an HTML file (e.g. for scraping data from a website), then a popular library is beautifulsoup.

JSON APIs

Web services that provide JSON formatted data over HTTP are increasingly popular. The star wars API is a toy example of this. It provides json files containing information about the star wars films.

It’s pretty out of date, but we will use it anyway.

We can get data by making HTTP requests to https://swapi.py4e.com/api/ and other endpoints which are specified in the documentation.

from urllib.request import urlopen

url = 'https://swapi.py4e.com/api/'
response = urlopen(url)
data = response.read().decode('utf-8')

print(data)

The output is a json-formatted string with the keys “people”, “planets”, “films”, “species”, “vehicles” and “starships”. Each key has a url to the appropriate document as the value.

{"people":"https://swapi.py4e.com/api/people/","planets":"https://swapi.py4e.com/api/planets/","films":"https://swapi.py4e.com/api/films/","species":"https://swapi.py4e.com/api/species/","vehicles":"https://swapi.py4e.com/api/vehicles/","starships":"https://swapi.py4e.com/api/starships/"}

To extract the data into a python dictionary, we need to pass the resultant string into the json.loads() function. We can update our script as follows:

from urllib.request import urlopen
import json

url = 'https://swapi.py4e.com/api/'
response = urlopen(url)
data = json.loads(response.read().decode('utf-8'))

We can loop over the dictionary items and print each in turn.

for key, value in data.items(): 
    print(f"{key}: {value}")
people: https://swapi.py4e.com/api/people/
planets: https://swapi.py4e.com/api/planets/
films: https://swapi.py4e.com/api/films/
species: https://swapi.py4e.com/api/species/
vehicles: https://swapi.py4e.com/api/vehicles/
starships: https://swapi.py4e.com/api/starships/

We can see that the API has given us more urls to investigate. So we can add something like this to the end of our script:

response = urlopen(data['people'])
people = json.loads(response.read().decode('utf-8'))

Now in two lines of code we have grabbed more data from the web.

Create a function for loading JSON data

Write a load_json() function that takes a url as an argument and returns a prepared python object. The function can assume that the response contains a json-formatted string.

You should be able to replicate the above chain of requests like this:

url = 'https://swapi.py4e.com/api/'
swapi = load_json(url)
people = load_json(swapi['people'])

Paged data

The API provides the data in pages. Study the people data. You should see that the dictionary keys include 'count', 'next', 'previous' and 'results'. We have only grabbed the first page of data with this request. The 'results' key contains the data for this page. The url in the 'next' key will grab the next page.

Different API’s will have different formats, when writing an application that interacts with an API, you usually need to encapsulate the specifics of the data structures into your code.

We can perform a list comprehension to extract a list of names from the first page.

[p['name'] for p in people['results']]

The result is a simple list.

['Luke Skywalker', 'C-3PO', 'R2-D2', 'Darth Vader', 'Leia Organa', 'Owen Lars', 'Beru Whitesun lars', 'R5-D4', 'Biggs Darklighter', 'Obi-Wan Kenobi']

However, this is only one page. So we need to loop over all the pages and extract the necessary data.

You should already have a function which looks something like this.

def load_json(url):
    """Load json data from a given url"""
    print(f"Requesting: {url}")
    response = urlopen(url)
    json_data = response.read().decode('utf-8')
    return json.loads(json_data)

If yours is better, let me know. I’ve added a print statement to indicate when we are requesting data from the web.

Create a unique list of planets

Write a script to gather the people data from all pages.

  1. Start the list by requesting the first page as we already have.
  2. You will need to start an infinite loop.
  3. Within the loop, request the json data from the url in the next key
  4. Add the new data onto the result
  5. You can break the loop if the next key is empty.
  • Print a list of strings containing all the people names.
  • Print a list of strings containing all the planet names.
  • Convert your code into a reusable function load_collection.

A partial solution follows. Make sure you have a go yourself before looking ahead.

To implement this, we can add another function to loop over the collection, one page at a time.

def load_collection(url):
    result = []
    while True:
        chunk = load_json(url)
        result += chunk['results']
        if not chunk['next']: 
            break
        url = chunk['next']
    return result

See how the above function prepares a result list to contain the data it gathers. It defines an infinite loop in which it loads the chunk of data, adds chunk['results'] to the growing result list and then checks for a 'next' key. If it doesn’t find a 'next' key then it calls break to exit the loop and returns the result. If there is a 'next' key then it updates the url and repeats the loop to get the next page of data.

With the above two functions load_json and load_collection we can now do a lot with just a few lines of code.

url = 'https://swapi.py4e.com/api/'
swapi = load_json(url)
people = load_collection(swapi['people'])
print([p['name'] for p in people])
Requesting: https://swapi.py4e.com/api/
Requesting: https://swapi.py4e.com/api/people/
Requesting: https://swapi.py4e.com/api/people/?page=2
Requesting: https://swapi.py4e.com/api/people/?page=3
Requesting: https://swapi.py4e.com/api/people/?page=4
Requesting: https://swapi.py4e.com/api/people/?page=5
Requesting: https://swapi.py4e.com/api/people/?page=6
Requesting: https://swapi.py4e.com/api/people/?page=7
Requesting: https://swapi.py4e.com/api/people/?page=8
Requesting: https://swapi.py4e.com/api/people/?page=9
['Luke Skywalker', 'C-3PO', 'R2-D2', 'Darth Vader', 'Leia Organa', 'Owen Lars', 'Beru Whitesun lars', 'R5-D4', 'Biggs Darklighter', 'Obi-Wan Kenobi', 'Anakin Skywalker', 'Wilhuff Tarkin', 'Chewbacca', 'Han Solo', 'Greedo', 'Jabba Desilijic Tiure', 'Wedge Antilles', 'Jek Tono Porkins', 'Yoda', 'Palpatine', 'Boba Fett', 'IG-88', 'Bossk', 'Lando Calrissian', 'Lobot', 'Ackbar', 'Mon Mothma', 'Arvel Crynyd', 'Wicket Systri Warrick', 'Nien Nunb', 'Qui-Gon Jinn', 'Nute Gunray', 'Finis Valorum', 'Padmé Amidala', 'Jar Jar Binks', 'Roos Tarpals', 'Rugor Nass', 'Ric Olié', 'Watto', 'Sebulba', 'Quarsh Panaka', 'Shmi Skywalker', 'Darth Maul', 'Bib Fortuna', 'Ayla Secura', 'Ratts Tyerel', 'Dud Bolt', 'Gasgano', 'Ben Quadinaros', 'Mace Windu', 'Ki-Adi-Mundi', 'Kit Fisto', 'Eeth Koth', 'Adi Gallia', 'Saesee Tiin', 'Yarael Poof', 'Plo Koon', 'Mas Amedda', 'Gregar Typho', 'Cordé', 'Cliegg Lars', 'Poggle the Lesser', 'Luminara Unduli', 'Barriss Offee', 'Dormé', 'Dooku', 'Bail Prestor Organa', 'Jango Fett', 'Zam Wesell', 'Dexter Jettster', 'Lama Su', 'Taun We', 'Jocasta Nu', 'R4-P17', 'Wat Tambor', 'San Hill', 'Shaak Ti', 'Grievous', 'Tarfful', 'Raymus Antilles', 'Sly Moore', 'Tion Medon', 'Finn', 'Rey', 'Poe Dameron', 'BB8', 'Captain Phasma']

However, every time we call the function we have to wait for the API calls to complete. We could refactor the code to make the requests asynchronous and in parallel, but this is outside of the scope of this module. Instead, we will cache the results in files so that subsequent requests can get the local data rather than making a request to the API.

So, we want to define a new function which will check to see if we have a local copy of the data from a given url.

For this we need to define a mapping between the url and the filesystem path. For example, the file we get from this url:

https://swapi.py4e.com/api/people/?page=4

Might be saved somewhere like this:

./swapi/people?page=4.json

So, we can define a simple function to convert a url to a pathlib.Path object.

def path_from_url(url):
    root = Path('swapi')
    filename = f"{url[27:].strip('/').replace('/?', '_')}.json"
    return root / filename

Our function strips off the first 23 characters of the url as well as the trailing forward slash and it replaces the '/?' with a simple '_' to sanitise the filename a bit. It also adds everything into a ‘swapi’ folder to hold all the downloaded files.

Now we can define a new load_json_with_cache function that will only make the HTTP request if we don’t already have a copy. If it does download some data, it will save it to a file before returning it. If it finds a file, then it reads the local copy of the data and doesn’t need to download anything.

def load_json_with_cache(url):
    path = path_from_url(url)
    if not path.exists():
        data = load_json(url)
        path.parent.mkdir(parents=True, exist_ok=True)
        with path.open('w') as f:
            json.dump(data, f, indent=2)
        return data
    print(f"Opening file: {path}")
    with path.open('r') as f:
        return json.load(f)

With this, we can now update our load_collection function to use load_json_with_cache rather than the simpler load_json function.

def load_collection(url):
    result = []
    while True:
        chunk = load_json_with_cache(url)
        result += chunk['results']
        if not chunk['next']: 
            break
        url = chunk['next']
    return result

Try loading the people data again and you should see a collection of files appear.

url = 'https://swapi.py4e.com/api/'
swapi = load_json(url)
people = load_collection(swapi['people'])
print([p['name'] for p in people])
Requesting: https://swapi.py4e.com/api/
Requesting: https://swapi.py4e.com/api/people/
Requesting: https://swapi.py4e.com/api/people/?page=2
Requesting: https://swapi.py4e.com/api/people/?page=3
Requesting: https://swapi.py4e.com/api/people/?page=4
Requesting: https://swapi.py4e.com/api/people/?page=5
Requesting: https://swapi.py4e.com/api/people/?page=6
Requesting: https://swapi.py4e.com/api/people/?page=7
Requesting: https://swapi.py4e.com/api/people/?page=8
Requesting: https://swapi.py4e.com/api/people/?page=9
['Luke Skywalker', 'C-3PO', 'R2-D2', 'Darth Vader', 'Leia Organa', 'Owen Lars', 'Beru Whitesun lars', 'R5-D4', 'Biggs Darklighter', 'Obi-Wan Kenobi', 'Anakin Skywalker', 'Wilhuff Tarkin', 'Chewbacca', 'Han Solo', 'Greedo', 'Jabba Desilijic Tiure', 'Wedge Antilles', 'Jek Tono Porkins', 'Yoda', 'Palpatine', 'Boba Fett', 'IG-88', 'Bossk', 'Lando Calrissian', 'Lobot', 'Ackbar', 'Mon Mothma', 'Arvel Crynyd', 'Wicket Systri Warrick', 'Nien Nunb', 'Qui-Gon Jinn', 'Nute Gunray', 'Finis Valorum', 'Padmé Amidala', 'Jar Jar Binks', 'Roos Tarpals', 'Rugor Nass', 'Ric Olié', 'Watto', 'Sebulba', 'Quarsh Panaka', 'Shmi Skywalker', 'Darth Maul', 'Bib Fortuna', 'Ayla Secura', 'Ratts Tyerel', 'Dud Bolt', 'Gasgano', 'Ben Quadinaros', 'Mace Windu', 'Ki-Adi-Mundi', 'Kit Fisto', 'Eeth Koth', 'Adi Gallia', 'Saesee Tiin', 'Yarael Poof', 'Plo Koon', 'Mas Amedda', 'Gregar Typho', 'Cordé', 'Cliegg Lars', 'Poggle the Lesser', 'Luminara Unduli', 'Barriss Offee', 'Dormé', 'Dooku', 'Bail Prestor Organa', 'Jango Fett', 'Zam Wesell', 'Dexter Jettster', 'Lama Su', 'Taun We', 'Jocasta Nu', 'R4-P17', 'Wat Tambor', 'San Hill', 'Shaak Ti', 'Grievous', 'Tarfful', 'Raymus Antilles', 'Sly Moore', 'Tion Medon', 'Finn', 'Rey', 'Poe Dameron', 'BB8', 'Captain Phasma']

Now the files have been saved, if you run the function again it won’t be delayed by waiting for the API to respond because it will use the local files.

This is a general purpose (ish) solution and can be used for all of the swapi api.

Calculate some statistics

Inspect the swapi data. Practice printing selected parts. Try extracting the first value of a given type and printing the dictionary keys.

Try this…

url = 'https://swapi.py4e.com/api/'
swapi = load_json(url)

# load all the species data
species = load_collection(swapi['species'])

# print the keys from the first item
print(species[0].keys())

# create a dictionary of species by classification
species_by_class = {}
for s in species:
    classification = s['classification'] 
    try:
        species_by_class[classification].append(s['name'])
    except KeyError:
        species_by_class[classification] = [s['name']]
print(species_by_class)

Try to calculate a few statistics.

For example:

  • How many people from each species are listed?
  • How many species from each classification are listed?
  • How many people, planets, starships and vehicles are listed for each film?
  • Which film has the longest “opening_crawl”?