GAMR1520: Markup languages and scripting

python logo

Lab 2.1: Files and folders

Part of Week 2: Files, functions and classes

General setup

For all lab exercises you should create a folder for the lab somewhere sensible.

Assuming you have a GAMR1520-labs folder, you should create a GAMR1520-labs/week_2 folder for this week and a GAMR1520-labs/week_2/lab_2.1 folder inside that.

Create a collection of scripts. If necessary, use folders to group related examples.

GAMR1520-labs
└─ week_2
    └─ lab_2.1
        ├─ experiment1.py
        └─ experiment2.py

Try to name your files better than this, the filename should reflect their content. For example, string_methods_.py, conditionals.py or while_loops.py.

Make sure your filenames give clues about what is inside each file. A future version of you will be looking back at these, trying to remember where a particular example was.

General approach

As you encounter new concepts, try to create examples for yourself that prove you understand what is going on. Try to break stuff, its a good way to learn. But always save a working version.

Modifying the example code is a good start, but try to write your own programmes from scratch, based on the example code. They might start very simple, but over the weeks you can develop them into more complex programmes.

Think of a programme you would like to write (don't be too ambitious). Break the problem down into small pieces and spend some time each session trying to solve a small problem.

In this set of exercises we will explore how to use python to save and load structured data to and from files.

We will begin by looking at direct file access and move on to look at how to work with json file formats. We will introduce parts of the standard python library that can help with this. In particular, pathlib and json.

Table of contents

Basic file IO

The most common way to interact with files in python is using the built-in open() function. The open() function returns a file object. File objects expose methods such as read() and write() which allow us to load and save data to the file system.

When creating a file object, we need to specify the location of the file and either read (‘r’) or write (‘w’) mode. For example, when a file is open for read-only actions, it cannot be used to write data.

The default read and write modes read and write strings to files. There are other modes (e.g. for reading/writing binary data) but we can ignore these for now. They work in a very similar way.

Writing to files

The following code will create a file in the current directory called shopping.txt.

file = open('shopping.txt', 'w')
file.write("apples")
file.write("bananas")
file.write("cherries")
file.close()

file paths are always relative to the location from which the code was executed. This is sometimes the location of the python source code. But code can be executed from another location.

The above code will create the file if it doesn’t exist. It will replace any existing content in the file with the string 'applesbananascherries'.

Reading from files

We can read the data from a file into a programme by opening the file in read mode and calling the read() method.

file = open('shopping.txt', 'r')
print(file.read())
file.close()

Following the example above, the output is as follows:

applesbananascherries

Be careful with this. the read() method will read the entire file by default. If the file is larger than the available memory, you may have a problem.

Optionally, you can pass a number of characters to read

file.read(10)

In our example, this would read 'applesbana'.

If the file does not exist, the code will raise a FileNotFoundError.

Traceback (most recent call last):
  File "read_file.py", line 1, in <module>
    file = open('shopping.txt', 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'shopping.txt'

Context managers

When we are finished working with a file, it is important that we close it. Python has a built in way to do this using a context manager. Context managers allow us to use a with clause to acquire and release resources such as files. We specify a code block to handle the resource.

So we can rewrite the above examples like this:

with open('shopping.txt', 'w') as file:
    file.write("apples")
    file.write("bananas")
    file.write("cherries")

This is familiar syntax, context managers are compound statements.

Notice we no longer need to close the file. The context manager automatically closes it when the code block ends. It will also close the file in the event of an error.

Always (ALWAYS) use with clauses to open files.

We can read our data in a similar way.

with open('shopping.txt', 'r') as file:
    data = file.read()
print(data)

Line-oriented files

It is very common to organise data in files using line breaks (or newline characters) as separators. If we want to add line breaks we can simply add them into our data using the newline character '\n'.

with open('shopping.txt', 'w') as file:
    file.write("apples\n")
    file.write("bananas\n")
    file.write("cherries\n")

Alternatively, we can use the print() method to write to a file by passing a file argument.

with open('shopping.txt', 'w') as file:
    print("apples", file=file)
    print("bananas", file=file)
    print("cherries", file=file)

Notice we are not adding line breaks because the print() function will add the line breaks automatically by default.

The print() function takes both sep and end arguments. The sep argument defines the string to use as a separator between values to be printed, and defaults to a space ' '. The end argument defines a string to add to the end of the printed data, it defaults to '\n'. Passing in alternative values for these can be useful.

print(1, 2, 3, sep=" + ", end=" = ")
print(1 + 2 + 3)
1 + 2 + 3 = 6

Either approach produces a file which is divided into entries that can be easily extracted individually.

apples
bananas
cherries

Write data into some files

We can use the built in dir() method to list all the available attributes on a python object.

  • Create a python script to write the output from dir([]) into a file list.txt.
  • Create a python script to write the output from dir('hello') into a file string.txt.

Make sure you use the context manager approach.

Reading data line-by-line

There are a number of ways to extract the data back out.

For line-delimited data we can use the readline() method to read one line of our file.

with open('shopping.txt', 'r') as file:
    apples = file.readline()
    bananas = file.readline()
    cherries = file.readline()
print(apples)
print(bananas)
print(cherries)

Notice that each line retains its newline character

This is a bit cumbersome. Another option is to use the readlines() method (plural) to read the whole file contents as a list of the lines in the file.

This allows easy access to data via slicing and other sequence operations.

with open('shopping.txt', 'r') as file:
    data = file.readlines()
print(data[1]) # 'bananas\n'

Passing a file object into the list constructor will do much the same thing.

with open('shopping.txt', 'r') as file:
   data = list(file)
print(data[1]) # 'bananas\n'

Alternatively, since the file objects are iterable, it can be useful to iterate directly over them.

with open('shopping.txt', 'r') as file:
    for line in file:
        print(line)

Again, the newline characters are kept.

Using string methods

A simple way to remove the newline characters is to load all the data from the file as a string and then call str.split() with a newline character as the specified separator.

with open('shopping.txt', 'r') as file:
    for line in file.read().split('\n'):
        print(line)

Here, we use .split('\n') to convert our string into an array of substrings. However, there is a problem in that if our file is empty and the file.read() method returns an empty string, then calling split('\n') will give a single item list containing one empty string.

''.split('\n')
['']

This is why the string method splitlines() exists. It is specifically designed to handle line-oriented data from a file. It will produce an empty list in the event of an empty file.

''.splitlines()
[]

We can use it like this.

with open('shopping.txt', 'r') as file:
    for line in file.read().splitlines():
        print(line)

Here the file read() method returns the complete string contents of the files with newline characters intact and the splitlines() method divides the string into a list of lines with the newline delimiters removed.

apples
bananas
cherries

Note that again, this will load the entire file so, for very large files, it may not be an appropriate approach.

Read your files back

Create a python script to read the data from both list.txt and string.txt.

Again, make sure you use the context manager approach.

Look back at the section from last week on the set data type. Expand your programme to print out the method names which:

  • are unique to strings
  • are unique to lists
  • are shared between lists and strings

Making things easier with pathlib

You should usually consider using built-in module pathlib when you need to work with files and folders, simply because it makes things neater and easier.

Because pathlib is relatively new (it was introduced in 2014 with python 3.4), you will still find lots of example code online that use the older approaches above. It can take years for new approaches to gain in popularity Many python users still do not know that pathlib exists. Because of course, the old ways still work.

Path objects

The core of pathlib is the Path class. The Path class allows us to create path instances, objects that represent individual paths. That is, objects that represent the locations of files and folders in the filesystem.

Instances of the Path class are created by passing folder and file names into the Path constructor.

from pathlib import Path

my_path = Path('folder1', 'folder2', 'filename.txt')

Depending on your platform, this will either generate an instance of the PosixPath class (linux/mac) or the WindowsPath class (windows). Both classes have the same set of useful properties and methods which can be used to inspect and manipulate the filesystem.

The path naively represents the folders and the filename we provided as a relative path from the current working directory (i.e. the location from which the script is executed).

print(my_path)
PosixPath('folder1/folder2/filename.txt')

You may get a different result, depending on your platform.

Creating absolute paths

We can also use the Path.absolute() method to return a new Path object with the absolute path rather than the relative path.

print(my_path.absolute())
PosixPath('/home/graeme/Teaching/GAMR1520/folder1/folder2/filename.txt')

Again, your result will look different, but should make sense to you.

Checking if a path exists

Notice that the path points to a file (and folder structure) that does not exist. We can check to see if the file exists using Path.exists().

print(my_path.exists())
False

Creating files and folders

We can create the missing folders and the file using Path.mkdir() and Path.touch().

my_path.mkdir(parents=True)
my_path.touch()
print(my_path.exists)
True

You should see the folders and file have been created.

Both mkdir() and touch will actually fail if the folders do exist. Passing exist_ok=True into both of them will suppress the error. See the Path.mkdir and Path.touch documentation for details.

If our path exists, we can also test to see if we are pointing at a folder or a file.

print(my_path.is_dir())
print(my_path.is_file())
False
True

Directory listings

If we are pointing at a folder, we can list the contents using Path.iterdir(). This will iterate over the contents of the folder and return a Path object for each file or folder it finds.

Consider a path object pointing to the current working directory ('.').

from pathlib import Path

here = Path('.')
for p in here.iterdir():
    if p.is_dir():
        print(str(p).ljust(20), "(folder)", sep="\t")
    if p.is_file():
        print(str(p).ljust(20), "(file)", sep="\t")
shopping.txt        (file)
list.txt            (file)
string.txt          (file)
folder1             (folder)
my_experiments      (folder)
script.py           (file)

Drilling down into subdirectories is perfectly possible.

Reading and writing data

Path objects have access to the Path.open() method which is virtually identical to the builtin open() function except it obviously doesn’t need a filename/path to be provided.

my_path = Path('shopping.txt')
with my_path.open('r') as my_file:
    data = my_file.read().splitlines()

Joining paths

Often we want to programmatically determine the path to a file. We can join paths using the Path.joinpath() method.

root = Path('folder')
my_path = root.joinpath('filename.txt')

Alternatively, we can use the slash operator (/) to join paths. This is equivalent to the above.

root = Path('folder')
my_path = root / 'filename.txt'

Though all of these were possible before pathlib was introduced, using the os module (and in particular os.path), they tended to produce somewhat messy code. The object-oriented interface provided by pathlib makes working with paths simple and intuitive.

from pathlib import Path

root = Path('my_lists')
root.mkdir(exist_ok=True)

shopping = ['apples', 'bananas', 'cherries']

name = input('Enter a name for the list:\n>>> ')
my_file = root / f'{name}.txt'

with my_file.open('w') as f:
    for item in shopping:
        print(item, file=f)

The above code creates a folder called 'my_lists' (if it doesn’t already exist). It then asks the user to provide a name for the list. It then creates a file in the 'my_lists' folder with the given name. The beauty of this is that the code automatically handles the filesystem semantics of different operating systems. This code will work on linux, mac and windows, producing the appropriate directory structure.

Code using the new pathlib module is simpler and more pythonic.

More shopping lists

Study the following simple programme. The code allows a user to construct a shopping list on the command line.

There are three compound statements in the code.

  • An infinite while loop which continues until it is exited using the break statement.
  • An if conditional protecting the break statement which is controlled by user input.
  • A for loop which prints each item of the list on a separate line.
shopping = ()
width = 20
hline = '=' * width
while True:
    print(hline)
    print('shopping list'.center(width))
    print(hline)
    for item in shopping:
        print(item.center(width))
    print(hline)
    keep_going = input("add an item to the list? [y/n]")
    if not keep_going.lower().startswith('y'):
        break
    shopping += (input("New item: "),)

Make sure you understand the three compound statements in the code.

  1. Add code before the while loop that loads the initial shopping data from a file.
  2. Add code after the while loop to save the data back to the same file before it exits.

The code should still allow the user to append new values.

Once you have attempted this, download the solution and compare. Can you improve the solution?

What about editing items or deleting items from the list?

JSON data

Javascript Object Notation is a very simple and very common, human-readable data format for transferring data between systems.

Essentially, the json format maps onto the core python data types, numbers, strings and booleans as well as dictionaries and lists.

Python provides the built-in json module for generating/handling json data. To use json, we first need to import the module.

Start a new file for this

import json

Let’s also create a nested python data structure, anything will do.

Add this to your file

saved_game = {
    "health": 82,
    "attack": 21,
    "shield": 5,
    "magic": 76,
    "speed": 45,
    "inventory": [
        {
            "name": "blue potion", 
            "type": "health", 
            "power": 6
        },
        {
            "name": "massive sword", 
            "type": "weapon", 
            "power": 9
        }
    ],
    "check-point": 84
}

Generating and reading JSON strings

It’s possible to work directly with JSON strings using json.loads() and json.dumps().

The following code uses json.dumps() to convert our dictionary into a json string.

my_json = json.dumps(saved_game)
print(my_json)

You should be extending your file, you will need the import statement and the saved_game assignment to a dictionary literal (as above) in your script for this to work.

The string output contains the same data in json format.

{"health": 82, "attack": 21, "shield": 5, "magic": 76, "speed": 45, "inventory": [{"name": "blue potion", "type": "health", "power": 6}, {"name": "massive sword", "type": "weapon", "power": 9}], "check-point": 84}

You will notice, the output is very similar to the python syntax. The main difference is that the quotes are now all double.

There are some more significant differences, for example, json objects (equivalent to dictionaries) can only have string keys so if we use e.g. numeric keys, they will be converted to strings. If we use more complex objects as keys (e.g. tuples) then the conversion will fail with a TypeError.

For convenience, we can pass formatting parameters such as indent to json.dumps().

my_json = json.dumps(saved_game, indent=2)
print(my_json)

Again, this is missing the import and the dictionary literal. Hopefully its clear what you need to do. You just need to add the indent=2 argument.

This produces a much more readable string with newline characters and spaces for indentation.

{
  "health": 82,
  "attack": 21,
  "shield": 5,
  "magic": 76,
  "speed": 45,
  "inventory": [
    {
      "name": "blue potion",
      "type": "health",
      "power": 6
    },
    {
      "name": "massive sword",
      "type": "weapon",
      "power": 9
    }
  ],
  "check-point": 84
}

If we want to load data from a json string we can use json.loads().

import json

my_json_string = '{"some_key": "some value", "another_key": 1000}'
my_dictionary = json.loads(my_json_string)
print(my_dictionary)
print(type(my_dictionary))
print(type(my_dictionary["another_key"]))

The output shows the string has been converted to a python dictionary and the integer has been properly identified (as it has no quotes in the string).

{'some_key': 'some value', 'another_key': 1000}
<class 'dict'>
<class 'int'>

Writing to a file

We can use the json.dump() method to write the result to a file. The same indent parameter is also an option here.

That’s json.dump(), not json.dumps() (which means dump to a string).

from pathlib import Path
import json

path = Path("saved_game.json")

with path.open('w') as f:
    json.dump(saved_game, f, indent=2)

Reading the json data back out from our file is also very simple.

from pathlib import Path
import json

path = Path("saved_game.json")

with path.open('r') as f:
    my_dict = json.load(f)
print(my_dict)

JSON provides a very convenient way to store complex data and is an excellent data interchange format between programmes, whether they are written in python or any other language.

Comma separated values (CSV)

JSON is very flexible and allows for complex, nested data structures to be easily saved and restored. However, if the data are simpler, or can be transformed into a simple grid of values, then CSV might be a better choice.

Imagine a dataset like this.

animals = [{'name': 'Anteater', 'description': 'Eats ants'},
           {'name': 'Bear', 'description': 'Grizzly'},
           {'name': 'Chimp', 'description': 'Chump'},
           {'name': 'Dog', 'description': 'Friend'}]

We could save it as simple JSON ('animals.json').

from pathlib import Path
import json

animals = [{'name': 'Anteater', 'description': 'Eats ants'},
           {'name': 'Bear', 'description': 'Grizzly'},
           {'name': 'Chimp', 'description': 'Chump'},
           {'name': 'Dog', 'description': 'Friend'}]

path = Path('animals.json')

with path.open('w') as file:
    json.dump(animals, file, indent=2)
[
  {
    "name": "Anteater",
    "description": "Eats ants"
  },
  {
    "name": "Bear",
    "description": "Grizzly"
  },
  {
    "name": "Chimp",
    "description": "Chump"
  },
  {
    "name": "Dog",
    "description": "Friend"
  }
]

Notice that the JSON format wastes a lot of space because of the repetition of keys.

The strings "name" and "description" are repeated for each record.

We could save a lot of space by converting it to comma separated values, e.g. 'animals.csv'.

name,description
Anteater,Eats ants
Bear,Grizzly
Chimp,Chump
Dog,Friend

Now the first row contains the column names and the data are much more compact. The file is still human-readable and also, the CSV formatted data can be opened as a spreadsheet.

The csv module

The csv module takes some getting used to, but is really very simple. To write data we can use a csv.DictWriter object.

It’s also possible to use a simpler csv.Writer object with slightly less convenience.

from pathlib import Path
from csv import DictWriter

animals = [{'name': 'Anteater', 'description': 'Eats ants'},
           {'name': 'Bear', 'description': 'Grizzly'},
           {'name': 'Chimp', 'description': 'Chump'},
           {'name': 'Dog', 'description': 'Friend'}]

path = Path('animals.csv')

with path.open('w') as file:
    writer = DictWriter(file, fieldnames=['name', 'description'])
    writer.writeheader()
    writer.writerows(animals)
name,description
Anteater,Eats ants
Bear,Grizzly
Chimp,Chump
Dog,Friend

Notice we write the CSV header row before we write all the data.

Converting between formats

Say you need to read from a JSON file and write into a CSV file. We can use one with clause to open both files and ensure both are closed.

The conversion is simple, it requires both json.load() and a csv.DictWriter object.

from pathlib import Path 
from csv import DictReader
import json

inpath = Path('animals.json')
outpath = Path('animals_from_json.csv')

with inpath.open('r') as infile, outpath.open('w') as outfile:
    animals = json.load(infile)
    writer = DictWriter(outfile, fieldnames=['name', 'description'])
    writer.writeheader()
    writer.writerows(animals)

The code here is doing a fairly complex job but with some care the code can be made extremely clear.

Converting in the other direction requires a csv.DictReader object and json.dump().

Notice the csv.DictReader object will use the header row to determine the field names. Also, because it’s iterable, we can just pass it into list() to load all the data.

from pathlib import Path 
from csv import DictReader
import json

inpath = Path('animals.csv')
outpath = Path('animals_from_csv.json')

with inpath.open('r') as infile, outpath.open('w') as outfile:
    animals = list(DictReader(infile))
    json.dump(animals, outfile, indent=2)

Again, a seemingly complex task can be specified in eight lines of code.

In a real world scenario it is likely that the data would require processing between input files and output files. The best way to do this is to use functions. We will learn about functions in the next set of exercises.