Introductory Programming in Python: Lesson 18
Files for Input and Output

[Prev: Random Numbers] [Course Outline] [Next: Regular Expressions]

Filesystems, Directories, and Files

The concept of a file is fairly intuitive, but as with all things programming, intuition is not enough. Let us briefly explore exactly what a file is. Microsoft has served, as always, to confuse and obfuscate the concept of a file. A file is interchangeably called a document, a file, a song, a movie, a spreadsheet, etc... It is important to understand that the multiple names used in common parlance for a file are in fact references to the use to which that file is put, rather than the idea of what a file actually is. So what is a file? Simply put, a file is an ordered collection of data associated with a name and location by a filesystem on a particular device. The device might be a hard drive, a flash disk, a CD-ROM, or even your cellphone. The same file may be used for many different things, e.g. I can read my .mp3 file as text. It makes very little sense, but it is possible. Alternatively, one could attempt to play the contents of a large spreadsheet through one's sound card. Again, it won't make much sense, in fact it'll just sound like a burst of static, but it is possible.

A file is a name associated with an ordered collection of data on some storage medium

Another concept that the GUI's of the 90's onwards have eroded significantly is the idea of the filesystem. Viewed on a physical storage device, we encounter a number of problems dealing directly with files. Firstly, a single file need not be stored in a single contiguous area; it can be fragmented. Secondly, there's no apparent structure or order to where files are stored relative to one another. Your word processing documents might be stored right next to, or even intermeshed with, files belonging to the operating system, your music collection, or your applications. Clearly we need a way of imposing a logical structure onto a collection of files. So we introduce a method of grouping arbitrary files together, namely the directory, which you may know of as a folder. Generally, any file can be put into any directory, but cannot exist in multiple directories at once. Thus we have given files a location. However, directories can also contain other directories, introducing a hierarchical structure of files within directories, within other directories, within ... oh hell. Where does it end? It does end, or rather it starts somewhere!

A directory is an arbitrary unordered grouping of files

Every filesystem has a starting point called the root. In MS-DOS, Windows, Symbian and some other cellphone OS's, each filesystem is assigned it's own root using a letter from the alphabet, as in C:\. Note the backslash! In linux, unix and any other POSIX complaint OS, the root is simply called /, and other filesystems can be placed inside the root, much like directories can be placed inside one another. So now we have a way to specify a particular file exactly, even if two files might have the name name, no two files can have exactly the same location. The location of a file is always specified from the root down through each directory to the file, including the name of the file, eg.

C:\Documents and Settings\James\Desktop\todo.txt
/home/james/Desktop/todo.txt

Note how in both cases we start with the root, and name each directory successively, zeroing in on the directory containing the file of interest. We separate the directory names with \ in the case of windows/MS-DOS, or / in the case of Linux/POSIX. The explicit sequence from root to name is known as the full path of the file, as we have followed the full path from root through each directory to the file.

Relative Paths and the Working Directory

Of course specifying the full path of every file every time we wish to use it is inconvenient. Many operating systems thus include the idea of a working directory. Working directories are tied to login sessions, and are not readily apparent in GUI's. When a user logs in to a system, their working directory for that login session is usually set to their home directory. Various OS commands can change or print out the working directory. cd is used to change the working directory to another one. Whenever a file or directory is specified, and the specification is not a full path, the file is considered relative to the current working directory. For example, specifying only a name for a file, implies the file we are looking for is in the working directory. When program runs, it inherits the working directory from the login session from which it is run. In the example below, the full path to the working directory is displayed in the prompt.

/home/james $ ls
todo.txt
/home/james $ cd /home
/home $ ls
james
/home $ cat james/todo.txt
I have nothing to do!
/home $

Note how the final command specifies an incomplete path 'james/todo.txt', and not the full path. Because a full path was not specified, the working directory ('/home' at the time) is prepended to the name, yielding '/home/james/todo.txt'. Thus we are able to specify files and directories lower down in the directory hierarchy in a relative manner.

Of course if a file is in a directory somewhere above or rather outside the working directory, it would seem we must still use the full path to specify it, but there are a few shortcuts we can take.

./ indicates the working directory, not much help to us really...
../ indicates the directory above the directory specified in the path so far.

For example, if the working directory were '/home/james/IntroPython'

index.html would have the full path /home/james/IntroPython/index.html
./index.html would have the full path /home/james/IntroPython/index.html
data/testinput.txt would have the full path /home/james/IntroPython/data/testinput.txt
../todo.txt would have the full path /home/james/todo.txt
../amusement/phdcomics.tar.gz would have the full path /home/james/amusement/phdcomics.tar.gz
../../../bin/ls would have the full path /bin/ls
/home/james/../../bin/ls would have the full path /bin/ls

Files for Input and Output

And all this has been leading up to the idea that programs need not limit themselves to key strokes and the screen as input and output methods respectively. Files can be used as both input and output. Files allow us to conveniently input large quantities of data, and similarly output large quantities of data for later review. When we want to work with files, we must clearly be able to specify the file we wish to work with, using it's location, and name, i.e. a valid full or relative path.

Opening Files

Opening a file in python is simple. We use the open function. The open function returns a file object, given a path to a file in the form of string, and a string specifying whether to open the file for reading or writing.

All files in python are treated as text files, meaning they are broken up into units called lines. However, there is also a concept called a file pointer. A file pointer is like a cursor in a word processor, which sits between characters in a file. When we read from a file, we read from the file pointer onwards to the right. Similarly all writing done to the file will be from the file pointer onwards, overwriting if the file already contained data to the right of the file pointer, otherwise extending the size of the file as necessary. Both reading and writing a file, reposition the file pointer at the end of the sequence read or written.

Python 2.4.3 (#1, Oct	2 2006, 21:50:13) 
[GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open("input.txt","r")
>>> f
<open file 'input.txt', mode 'r' at 0xb7dacentered>
>>>

Here we see the open function in action. It is most commonly used as the expression of an assignment statement, but obviously, being a function, can be used anywhere an expression is valid. The open function takes two parameters, firstly, the path of the file to open, and secondly, a string specifying the mode (read, write, or append) in which to open the file. Valid strings for the mode are

"r" opens the file in read only mode. The file must exist prior to opening. A plus suffix ("r+") means the file is opened for both reading and writing. In either case the file is opened with the file pointer at the beginning of the file.
"w" opens the file in write only mode. The file will be overwritten if it already exists, otherwise it will be created. A plus suffix ("w+") means the file is opened for both reading and writing, but the file is still truncated to zero bytes if it exists already. Obviously this means the file pointer starts at the beginning of the file. Writing beyond the end of a file simply enlarges the file to accommodate whatever is written.
"a" opens the file in append mode. The file can only be written to, and the file pointer starts at the end of the file, meaning all data subsequently written will be added to the end of the file. If the file doesn't already exist, it will be created. A plus suffix ("a+") means the file is opened for both reading and writing, however the file pointer is positioned at the beginning of the file, meaning "a+" and "r+" are essentially equivalent.

Reading From Files

Once we have an open file object, we can use its methods to both read from and write to the file it represents. When a file is opened for reading, the file pointer is positioned at the beginning of the file (position 0) which is just before the first character in the file. File objects provide a variety of methods to read from files...

<file object>.read(<count>) reads 'count' characters from the file object starting at the file pointer, and returns the read characters as a string. If less than 'count' characters remain to be read, then a string is still returned, but it will be shorter than count characters. An empty string is returned if the file pointer is at the end of the file.
<file object>.readline() reads from the file pointer onwards up to and including the first newline ('\n') character, and returns the read characters as a string. An empty string is returned if the file pointer is at the end of the file.
<file object>.readlines() reads from the file pointer until the end of file, returning a list of lines, each containing the trailing newline characters, as strings. An empty list is returned if the file pointer is already at the end of the file.

Using the following file as an example

This is a simple file
Containing only three lines
of text

We can demonstrate the use of the various file reading methods.

>>> f.read(3)
'Thi'
>>> f.readline()
's is a simple file\n'
>>> f.readlines()
['Containing only three lines\n', 'of text\n']
>>>

Note how using the simple 'read' method, we get only three characters (being the count we specified), and the 'readline' method continues from where 'read' finished. This illustrates the file pointer in action. The file pointer is position between the 'i' and 's' of 'This' on the first line of the file after the 'read' method is executed. After the 'readline' it is between the end of line one and the first character of line two ('C'). Thus, 'readlines' has two complete lines left to read when it is called.

Often a more useful way to read the lines of a file in sequence, is the for loop construct over a file. When used in a for loop a file object acts as a sequence of lines, as in

>>> f = open("input.txt","r")
>>> for line in f:
...		 print line.rstrip()
... 
This is a simple file
Containing only three lines
of text
>>>

Writing To Files

Writing files uses methods very similar to those used to read from files, except that writing is often buffered in memory, meaning the file on disk is only actually updated when newlines are written, the buffer is explicitly flushed, or the file object is closed.

<file object>.write(<string>) writes the contents of 'string' to the file, starting at the file pointer and overwriting data in the file, or enlarging the file as necessary. Note that no newline is added, so multiple successive write calls without any newlines in their respective strings, produce only one line of text in the file.
<file object>.writelines(<list>) writes the elements of the list, which must all be strings, in order to the file. Newlines are not added, so if the strings have no newlines, only one line of output will be written.

As an example, let's create a new file, and write some text out to it.

>>> w = open("newfile.txt","w")
>>> w.write("This is the first line of text ")
>>> w.write("This is still on the first line\n")
>>> w.writelines(["The second line\n", "The Salmon Mousse"])
>>> w.close()
>>>

Looking at 'newfile.txt', we see

This is the first line of text This is still on the first line
The second line
The Salmon Mousse

We see from the first two calls to 'write' that we have to supply our own newline characters to force line breaks in the output. And finally that closing our files is a good idea when they are opened for writing. Technically, python's garbage collector usually takes care of this for us, closing file objects before they are collected, but it is considered good practice to explicitly close files.

<file object>.close() closes the file, disallowing further read or write operation on the file. Any pending written data in the file buffer is flushed to disk. Technically, close is called automatically when a file object variable is garbage collected, but it is good practice to explicitly close the files your program opens, as the operating system has a limit to the number of files that may be open at one time.

Moving the File Pointer Manually

Occasionally we may want to move the file pointer manually, for example when reading a file of a format that allows us to skip or ignore large sections to get to the part of the file we want. Python provides two methods of file objects to do this, the first to tell us where the pointer is currently, and the second to move it. Both treat the file as a one dimensional stream of characters, much like a string. The file pointer is an integer index into this 'string' specifying the first character that would be read or written in the next operation.

<file object>.tell() returns the index of the file pointer for the file.
<file object>.seek(<position>[, <whence>]) moves the file pointer to position relative to a position indicated by whence. Whence can take on one of three values, defaulting to 0, each of which mean
1. Set position relative to the beginning of the file.
2. Set position relative to the current position, i.e. negative positions will be before the current position, and positive positions will be after it.
3. Set position relative to the end of the file, i.e. only negative positions will give us a position inside the file.

Truncating a File

Now that we can move the file pointer manually, we can do some funky things. We could for example seek to somewhere in the middle of a file, and overwrite the data there, without affecting the rest of the file. We might even want to remove information from a file, where we seek to the start of the information, and overwrite it with... well, spaces. Oh dear, we have a problem here! We can't actually remove information from a file, or can we? The truth is that we can only shorten the length of file, more formally known as truncating the file. What this means is that to delete information contained in the file starting at some position (x) up to some position (y), we must read everything from y to the end of the file, and then seek back to x, and write what we read. Finally we truncate the file.

<file object>.truncate([size]) truncates the file to be a maximum of size bytes in length. If size is not given, the file is truncated at the current file pointer position.