GAMR1520: Markup languages and scripting

GAMR1520

Markup languages and scripting

Working with files

Dr Graeme Stuart


Storing data in files provides a simple persistence mechanism

Data can be stored in simple text files, e.g. shopping.txt.

apples
bananas
cherries

We can use comma-separated files, e.g. shopping.csv, for more structure.

name, quantity
apples, 5
bananas, 6
cherries, 25

JSON format is even more structured. Here’s an example shopping.json.

[{"name": "apples", "quantity": 5},
 {"name": "bananas", "quantity": 6},
 {"name": "cherries", "quantity": 25}]

Accessing files with open() and close()

Write mode, 'w'

file = open('shopping.txt', 'w')
file.write("apples")
file.write("bananas")
file.write("cherries")
file.close()

Read mode, 'r'

file = open('shopping.txt', 'r')
print(file.read())
file.close()
applesbananascherries

Using with, a context manager

A context manager is an object that defines the context of a with statement. The built-in function open() is the most commonly used context manager in python. It manages both opening and closing a file so the context of the code block includes an open file handle.

with open('shopping.txt', 'w') as file:
    file.write("apples")
    file.write("bananas")
    file.write("cherries")

This code will always automatically close your files and should always be used when accessing files.

with open('shopping.txt', 'r') as file:
    print(file.read())
applesbananascherries

What is this file object?

The open() function returns different object types, depending on the mode.

for mode in ['r', 'w', 'rb', 'wb']:
    with open('shopping.txt', mode) as f:
        print(f"{mode}: {f}")
r: <_io.TextIOWrapper name='shopping.txt' mode='r' encoding='UTF-8'>
w: <_io.TextIOWrapper name='shopping.txt' mode='w' encoding='UTF-8'>
rb: <_io.BufferedReader name='shopping.txt'>
wb: <_io.BufferedWriter name='shopping.txt'>

Notice that text file objects are the default and that they default to utf-8 encoding, In most cases the default read (r) and write (w) modes are all you will ever need.

However, if you are working with files with binary formats, then it is possible.

Markup languages such as HTML are all about human-readability


Creating line-oriented files

Notice that the result we got when we wrote data into our 'shopping.txt' file was all in one line.

applesbananascherries

We can add literal newline characters into our files to write line-oriented data.

shopping = ['apples', 'bananas', 'cherries']
with open('shopping.txt', 'w') as file:
    for item in shopping:
        file.write(f"{item}\n")

This produces a result which is clearly easier to parse.

apples
bananas
cherries

Using print() with files

The print() function takes an optional file argument.

shopping = ['apples', 'bananas', 'cherries']
with open('shopping.txt', 'w') as file:
    for item in shopping:
        print(item, file=file)

This produce an equivalent result.

apples
bananas
cherries

This is because print() takes a end argument which defaults to a newline character, \n.


Parsing line-oriented data

With a line-oriented file, we can extract lines, one at a time using the readline() method. Each call will read data from the file up to and including the next '\n' character. If we print the output, then we end up adding an extra newline character.

with open('shopping.txt', 'r') as file:
    apples = file.readline()
print(apples)
apples
            <- notice the extra newline>

Calling readlines() will generate a list of all the line-oriented data within a file, with the '\n' characters intact.

with open('shopping.txt', 'r') as file:
    data = file.readlines()
print(data)
['apples\n', 'bananas\n', 'cherries\n']

Files are iterable

Files are iterable, so we can convert them to a list by passing a file object into the list constructor.

with open('shopping.txt', 'r') as file:
    shopping = list(file)
print(shopping)
['apples\n', 'bananas\n', 'cherries\n']

Looping is also possible. Which will keep the memory footprint lower.

with open('shopping.txt', 'r') as file:
    for line in file:
        print(line, end="")
apples
bananas
cherries

Again, in both cases the newline characters are kept.


Parsing data in other ways

To remove the newline characters we can call str.split('\n') on the entire file contents.

with open('shopping.txt', 'r') as file:
    shopping = file.read().split('\n'):
print(shopping)
['apples', 'bananas', 'cherries']

However, if our file is empty this will give a single item list containing one empty string. This is why the string method splitlines() exists.

with open('shopping.txt', 'r') as file:
    shopping = file.read().splitlines()
print(shopping)
['apples', 'bananas', 'cherries']

Using pathlib

Many complex issues arise when managing files and folders for a cross-platform application. An excellent tool in the python standard library is the pathlib module. The core pathlib tool is the Path class which provides an object-oriented interface representing locations in the filesystem.

from pathlib import Path

my_path = Path('folder1', 'folder2', 'filename.txt')
print(repr(my_path))
print(repr(my_path.absolute()))
PosixPath('folder1/folder2/filename.txt')
PosixPath('/home/graeme/Teaching/GAMR1520/folder1/folder2/filename.txt')

Notice we are using repr() above to print the string representation of the path object.

Note that the result will be platform dependent and the file may or may not exist at this point.


mkdir() and touch()

We can see that the file doesn’t exist.

my_path = Path('folder1', 'folder2', 'filename.txt')
print(my_path.exists())
False

We can create the containing folders and the file.

my_path.mkdir(parents=True)
my_path.touch()
print(my_path.exists)
True

iter_dir(), is_dir() and is_file()

We can iterate over the file system.

from pathlib import Path

here = Path('.')
for p in here.iterdir():
    if p.is_dir():
        print(str(p).ljust(20), "(folder)", sep="\t")
    if p.is_file():
        print(str(p).ljust(20), "(file)", sep="\t")
shopping.txt        (file)
list.txt            (file)
string.txt          (file)
folder1             (folder)
my_experiments      (folder)
script.py           (file)

Joining paths

The Path.joinpath() method can be used to create new Path objects.

root = Path('folder')
my_path = root.joinpath('filename.txt')

Alternatively, we can use the slash operator (/) to join paths. This is equivalent to the above.

root = Path('folder')
my_path = root / 'filename.txt'

As we shall see, this makes manipulating files and generating directory structures very simple.


Path.open()

Conveniently, the Path object includes an open method which calls the built-in open function for us. So we can create and populate multiple files with a simplified loop.

root = Path('folder')
root.mkdir()

all_lists = {
    'fruit': ['apples', 'bananas', 'cherries'],
    'colours': ['aquamarine', 'blue', 'cyan'],
    'animals': ['armadillo', 'baboon', 'cat']
}

for title, my_list in all_lists.items():
    my_path = root / f'{title}.txt'
    with my_path.open('w') as my_file:
        for item in my_list:
            print(item, file=my_file)

What if I have more complex data?

animals.py

animals = [{'name': 'Anteater', 'description': 'Eats ants'},
           {'name': 'Bear', 'description': 'Grizzly'},
           {'name': 'Chimp', 'description': 'Chump'},
           {'name': 'Dog', 'description': 'Friend'}]

animals.csv

name,description
Anteater,Eats ants
Bear,Grizzly
Chimp,Chump
Dog,Friend

animals.json

[{"name": "Anteater", "description": "Eats ants"}, {"name": "Bear", "description": "Grizzly"}, {"name": "Chimp", "description": "Chump"}, {"name": "Dog", "description": "Friend"}]

Write as CSV

The csv module takes some getting used to, but is really very simple.

from pathlib import Path
from csv import DictWriter

animals = [{'name': 'Anteater', 'description': 'Eats ants'},
           {'name': 'Bear', 'description': 'Grizzly'},
           {'name': 'Chimp', 'description': 'Chump'},
           {'name': 'Dog', 'description': 'Friend'}]

path = Path('animals.csv')

with path.open('w') as file:
    writer = DictWriter(file, fieldnames=['name', 'description'])
    writer.writeheader()
    writer.writerows(animals)

Write as JSON

import json

from pathlib import Path
import json

animals = [{'name': 'Anteater', 'description': 'Eats ants'},
           {'name': 'Bear', 'description': 'Grizzly'},
           {'name': 'Chimp', 'description': 'Chump'},
           {'name': 'Dog', 'description': 'Friend'}]

path = Path('animals.json')

with path.open('w') as file:
    json.dump(animals, file, indent=2)

JSON to CSV

Say you need to read from a JSON file and write into a CSV file. This requires json.load() and a csv.DictWriter object. One with clause can open both files.

from pathlib import Path 
from csv import DictWriter
import json

inpath = Path('animals.json')
outpath = Path('animals_from_json.csv')

with inpath.open('r') as infile, outpath.open('w') as outfile:
    animals = json.load(infile)
    writer = DictWriter(outfile, fieldnames=['name', 'description'])
    writer.writeheader()
    writer.writerows(animals)

We need to write the CSV header row before we write all the data.


CSV to JSON

Converting in the other direction requires a csv.DictReader object and json.dump().

Notice the csv.DictReader object will use the header row to determine the field names. Also, because it’s iterable, we can just pass it into list() to load all the data.

from pathlib import Path 
from csv import DictReader
import json

inpath = Path('animals.csv')
outpath = Path('animals_from_csv.json')

with inpath.open('r') as infile, outpath.open('w') as outfile:
    animals = list(DictReader(infile))
    json.dump(animals, outfile, indent=2)

Thanks for listening

Any questions?

We should have plenty of time for questions and answers.

Just ask, you are probably not the only one who wants to know.

Dr Graeme Stuart