The concept of a file is fairly intuitive, but as with all things programming, intuition is not enough. Let us briefly explore exactly what a file is. Microsoft has served, as always, to confuse and obfuscate the concept of a file. A file is interchangeably called a document, a file, a song, a movie, a spreadsheet, etc... It is important to understand that the multiple names used in common parlance for a file are in fact references to the use to which that file is put, rather than the idea of what a file actually is. So what is a file? Simply put, a file is an ordered collection of data associated with a name and location by a filesystem on a particular device. The device might be a hard drive, a flash disk, a CD-ROM, or even your cellphone. The same file may be used for many different things, e.g. I can read my .mp3 file as text. It makes very little sense, but it is possible. Alternatively, one could attempt to play the contents of a large spreadsheet through one's sound card. Again, it won't make much sense, in fact it'll just sound like a burst of static, but it is possible.
Another concept that the GUI's of the 90's onwards have eroded significantly is the idea of the filesystem. Viewed on a physical storage device, we encounter a number of problems dealing directly with files. Firstly, a single file need not be stored in a single contiguous area; it can be fragmented. Secondly, there's no apparent structure or order to where files are stored relative to one another. Your word processing documents might be stored right next to, or even intermeshed with, files belonging to the operating system, your music collection, or your applications. Clearly we need a way of imposing a logical structure onto a collection of files. So we introduce a method of grouping arbitrary files together, namely the directory, which you may know of as a folder. Generally, any file can be put into any directory, but cannot exist in multiple directories at once. Thus we have given files a location. However, directories can also contain other directories, introducing a hierarchical structure of files within directories, within other directories, within ... oh hell. Where does it end? It does end, or rather it starts somewhere!
Every filesystem has a starting point called the
root. In MS-DOS, Windows, Symbian and some other
cellphone OS's, each filesystem is assigned it's own root using a
letter from the alphabet, as in C:\
. Note the backslash!
In linux, unix and any other POSIX complaint OS, the root is simply
called /
, and other filesystems can be placed inside the
root, much like directories can be placed inside one another. So now we
have a way to specify a particular file exactly, even if two files
might have the name name, no two files can have exactly the same
location. The location of a file is always specified from the root down
through each directory to the file, including the name of the file,
eg.
C:\Documents and Settings\James\Desktop\todo.txt /home/james/Desktop/todo.txt
Note how in both cases we start with the root, and name each
directory successively, zeroing in on the directory containing the file
of interest. We separate the directory names with \
in the
case of windows/MS-DOS, or /
in the case of Linux/POSIX.
The explicit sequence from root to name is known as the full
path of the file, as we have followed the full path from root
through each directory to the file.
Of course specifying the full path of every file every time we wish
to use it is inconvenient. Many operating systems thus include the idea
of a working directory. Working directories are tied
to login sessions, and are not readily apparent in GUI's. When a user
logs in to a system, their working directory for that login session is
usually set to their home directory. Various OS commands can change or
print out the working directory. cd
is used to change the
working directory to another one. Whenever a file or directory is
specified, and the specification is not a full path, the file is
considered relative to the current working directory. For example,
specifying only a name for a file, implies the file we are looking for
is in the working directory. When program runs, it inherits the working
directory from the login session from which it is run. In the example
below, the full path to the working directory is displayed in the
prompt.
/home/james $ ls todo.txt /home/james $ cd /home /home $ ls james /home $ cat james/todo.txt I have nothing to do! /home $
Note how the final command specifies an incomplete path 'james/todo.txt', and not the full path. Because a full path was not specified, the working directory ('/home' at the time) is prepended to the name, yielding '/home/james/todo.txt'. Thus we are able to specify files and directories lower down in the directory hierarchy in a relative manner.
Of course if a file is in a directory somewhere above or rather outside the working directory, it would seem we must still use the full path to specify it, but there are a few shortcuts we can take.
./
indicates the working directory, not much help to us really...../
indicates the directory above the directory specified in the path so far.For example, if the working directory were '/home/james/IntroPython'
index.html
would have the full path /home/james/IntroPython/index.html
./index.html
would have the full path /home/james/IntroPython/index.html
data/testinput.txt
would have the full path /home/james/IntroPython/data/testinput.txt
../todo.txt
would have the full path /home/james/todo.txt
../amusement/phdcomics.tar.gz
would have the full path /home/james/amusement/phdcomics.tar.gz
../../../bin/ls
would have the full path /bin/ls
/home/james/../../bin/ls
would have the full path /bin/ls
And all this has been leading up to the idea that programs need not limit themselves to key strokes and the screen as input and output methods respectively. Files can be used as both input and output. Files allow us to conveniently input large quantities of data, and similarly output large quantities of data for later review. When we want to work with files, we must clearly be able to specify the file we wish to work with, using it's location, and name, i.e. a valid full or relative path.
Opening a file in python is simple. We use the open function. The open function returns a file object, given a path to a file in the form of string, and a string specifying whether to open the file for reading or writing.
All files in python are treated as text files, meaning they are broken up into units called lines. However, there is also a concept called a file pointer. A file pointer is like a cursor in a word processor, which sits between characters in a file. When we read from a file, we read from the file pointer onwards to the right. Similarly all writing done to the file will be from the file pointer onwards, overwriting if the file already contained data to the right of the file pointer, otherwise extending the size of the file as necessary. Both reading and writing a file, reposition the file pointer at the end of the sequence read or written.
Python 2.4.3 (#1, Oct 2 2006, 21:50:13) [GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> f = open("input.txt","r") >>> f <open file 'input.txt', mode 'r' at 0xb7dacentered> >>>
Here we see the open function in action. It is most commonly used as the expression of an assignment statement, but obviously, being a function, can be used anywhere an expression is valid. The open function takes two parameters, firstly, the path of the file to open, and secondly, a string specifying the mode (read, write, or append) in which to open the file. Valid strings for the mode are
Once we have an open file object, we can use its methods to both read from and write to the file it represents. When a file is opened for reading, the file pointer is positioned at the beginning of the file (position 0) which is just before the first character in the file. File objects provide a variety of methods to read from files...
<file object>.read(<count>)
reads
'count' characters from the file object starting at the file
pointer, and returns the read characters as a string. If less than
'count' characters remain to be read, then a string is still
returned, but it will be shorter than count characters. An empty
string is returned if the file pointer is at the end of the
file.<file object>.readline()
reads from the file
pointer onwards up to and including the first newline ('\n')
character, and returns the read characters as a string. An empty
string is returned if the file pointer is at the end of the
file.<file object>.readlines()
reads from the
file pointer until the end of file, returning a list of lines, each
containing the trailing newline characters, as strings. An empty
list is returned if the file pointer is already at the end of the
file.Using the following file as an example
This is a simple file Containing only three lines of text
We can demonstrate the use of the various file reading methods.
>>> f.read(3) 'Thi' >>> f.readline() 's is a simple file\n' >>> f.readlines() ['Containing only three lines\n', 'of text\n'] >>>
Note how using the simple 'read' method, we get only three characters (being the count we specified), and the 'readline' method continues from where 'read' finished. This illustrates the file pointer in action. The file pointer is position between the 'i' and 's' of 'This' on the first line of the file after the 'read' method is executed. After the 'readline' it is between the end of line one and the first character of line two ('C'). Thus, 'readlines' has two complete lines left to read when it is called.
Often a more useful way to read the lines of a file in sequence, is the for loop construct over a file. When used in a for loop a file object acts as a sequence of lines, as in
>>> f = open("input.txt","r") >>> for line in f: ... print line.rstrip() ... This is a simple file Containing only three lines of text >>>
Writing files uses methods very similar to those used to read from files, except that writing is often buffered in memory, meaning the file on disk is only actually updated when newlines are written, the buffer is explicitly flushed, or the file object is closed.
<file object>.write(<string>)
writes
the contents of 'string' to the file, starting at the file pointer
and overwriting data in the file, or enlarging the file as
necessary. Note that no newline is added, so multiple successive
write calls without any newlines in their respective strings,
produce only one line of text in the file.<file object>.writelines(<list>)
writes the elements of the list, which must all be strings, in
order to the file. Newlines are not added, so if the strings have
no newlines, only one line of output will be written.As an example, let's create a new file, and write some text out to it.
>>> w = open("newfile.txt","w") >>> w.write("This is the first line of text ") >>> w.write("This is still on the first line\n") >>> w.writelines(["The second line\n", "The Salmon Mousse"]) >>> w.close() >>>
Looking at 'newfile.txt', we see
This is the first line of text This is still on the first line The second line The Salmon Mousse
We see from the first two calls to 'write' that we have to supply our own newline characters to force line breaks in the output. And finally that closing our files is a good idea when they are opened for writing. Technically, python's garbage collector usually takes care of this for us, closing file objects before they are collected, but it is considered good practice to explicitly close files.
<file object>.close()
closes the file,
disallowing further read or write operation on the file. Any
pending written data in the file buffer is flushed to disk.
Technically, close is called automatically when a file object
variable is garbage collected, but it is good practice to
explicitly close the files your program opens, as the operating
system has a limit to the number of files that may be open at one
time.Occasionally we may want to move the file pointer manually, for example when reading a file of a format that allows us to skip or ignore large sections to get to the part of the file we want. Python provides two methods of file objects to do this, the first to tell us where the pointer is currently, and the second to move it. Both treat the file as a one dimensional stream of characters, much like a string. The file pointer is an integer index into this 'string' specifying the first character that would be read or written in the next operation.
<file object>.tell()
returns the index of
the file pointer for the file.<file object>.seek(<position>[,
<whence>])
moves the file pointer to position
relative to a position indicated by whence. Whence can take on
one of three values, defaulting to 0, each of which mean
Now that we can move the file pointer manually, we can do some funky things. We could for example seek to somewhere in the middle of a file, and overwrite the data there, without affecting the rest of the file. We might even want to remove information from a file, where we seek to the start of the information, and overwrite it with... well, spaces. Oh dear, we have a problem here! We can't actually remove information from a file, or can we? The truth is that we can only shorten the length of file, more formally known as truncating the file. What this means is that to delete information contained in the file starting at some position (x) up to some position (y), we must read everything from y to the end of the file, and then seek back to x, and write what we read. Finally we truncate the file.
<file object>.truncate([size])
truncates the
file to be a maximum of size bytes in length. If size is not given,
the file is truncated at the current file pointer position.