Introductory Programming in Python: Lesson 32
Database Theory

[Prev: Advanced Data Structures: Trees] [Course Outline] [Next: Relational Databases]

The Need for Databases

Technically, the word 'database' refers to any collection of structured data stored on a permanent medium. Early on in the programming days, it was realised that a whole bunch of time and effort was spent dealing with the recurring problem of permanent storage of data, and its efficient retrieval. Every program that required permanent storage of some sort would have to have specialised code written to deal with the low level management of the physical device, e.g. magnetic tape, disk drive, etc... It was soon recognised that an abstracted method of dealing with off-line storage was required and thus the concept of the database was born. The first 'databases' barely qualified to hold the name, and were in fact nothing but flat unstructured file systems managed by operating systems. These days the word 'database' implies a separate piece of software that manages a structured method of storage on top of any underlying file system, but those initial flat file systems were the beginning of the abstraction process. With them in place, programmers no longer needed to deal with writing to and from specific sectors of a tape or disk themselves, they could let the operating system handle it. It was very quickly discovered, however, that the flat file system model was insufficient, and at this point the evolution of databases split into two primary paths. The first path being the so called 'flat' model, which is basically an extension of the flat file system, and the second being the hierarchical model.

The Flat Model

In a flat file system each unit of data could be stored in it's own file, e.g. each customer's details could be stored in it's own file using the customer's account number as the file name. This approach is capable of dealing with a single dataset reasonably well, but provides no way to distinguish between multiple datasets. Consider the customer record example again, but now extend the idea to managing those records over a number of years. If we wanted to break the financials of the business into financial years for tax purposes, we have no way to determine which customer record file belongs in which year, or which transaction file belongs in which year. There are methods to make the distinction, but they are complex and unwieldy at best. Instead the concept of the structured file was developed. Using structured files, one could store multiple records within a single file, and then separate datasets, e.g. financial years, from one another by collecting multiple records belonging to a particular dataset into a single structured file, and each dataset would be in it's own file. This provided two major advantages. Firstly, the logical separation was useful to both programmers and users alike. Secondly, dealing with smaller datasets provides marked speed increases.

Structured files come in a variety of formats. All formats however share some common concepts, although individual implementations may differ. All structured files need some way of specifying when a record ends and a new one begins, and also a method of specifying within a record when a field, or attribute, ends and another begins. Some formats extend the second principle by allowing attributes to be named flexibly, as in the case of the first database applications, a la Dbase.

Structured formats tend to be in human readable form, i.e. they are text files obeying certain formatting conventions, but they may be binary as well. Text based files have a number of advantages:

They are human readable, and human editable, usually by manner of white space delimiters.
They are operating system and platform independent.
They are machine readable, and machine processable, by manner of specific field/record delimiters.

On the other hand, they have a number of disadvantages too:

Writing parsers for text formats can become arbitrarily complex, depending on the format.
Implementing higher level structure (above that of fields and records) is difficult, and drastically increases parser complexity, and processing time.
Difficult to maintain effective cross-referencing and record identification. Consider the effects of using a species name somewhere; one record may record species as 'hippo', another as 'hippopotamus', yet another might use the taxonomic name, and of course human error must always be taken into account: 'hippoppotamus'.

A number of these problems can be dealt with indirectly; standardised parsers can written, formats can be agreed upon in standards committees, and controlled vocabularies or ontologies created to 'enforce' consistent usage of terminologies. But human error cannot be eliminated.

The bioinformatics field, in particular, uses delimited files extensively, with a bewildering number of different formats. EMBL, Genbank, protein databank and a fasta files are but a few examples of such structured formats.

Spreadsheets

While originally developed for accounting and financial purposes, the spreadsheet has been co-opted for alternative use far more than any other generic 'office application'. The reason for this is that they visually represent the tabular structure of records and fields as rows and columns respectively. While each different spreadsheet application uses it's own format, they are generally capable of importing and exporting flat textual files in the form of 'comma separated values' (CSV) files and tab (or other special character) delimited files. In addition the spreadsheet applications themselves provide basic tools for manipulating the dataset, including easy editing of particular fields without worrying about underlying text formats, sorting, filtering, and basic mathematics. We'll learn more about spreadsheets as databases and why they're a bad idea shortly.

XML

In recent years, the eXtensibile Markup Language (XML) format has become a de facto standard for textual database storage. It extends the true flat textual format from simple records and fields to a hierarchical textual format, using 'tags' which can enclose (and hence imply containment, and thence hierarchy) text or other tags. Tags may in addition have arbitrary attributes associated with them, imparting meta-information about the tag itself, or even actual data.

An extensive look at XML and its uses is not worth repeating in these notes, as the topic is well covered elsewhere. An excellent tutorial on the practicalities of coding XML by hand, written by spiderpro is available here.

The Hierarchical Model

The second path of evolution from simple flat file systems was the hierarchical database. It provided a means of partitioning sub-datasets in the form of identifiable records, and expressing in a limited, but explicit, manner the relationships between records by grouping collections of records together, subordinate to parent records. Each group could then be further grouped together with similar groups, and with additional meta data about those groups in higher level groups; e.g. each customer had a file, and these files were grouped together by the customer's home branch (a directory), which were in turn grouped together in regions. The hierarchical model provided a simple way to allow speedy navigation of the database by ordered partitioning. It's easier to find customer Blogs' file amongst the 30 or so files in his home branch, rather than in amongst the thousands of files in the organisation as a whole, as with the flat file system model. To this day, the hierarchical model is helping people store, organise, and lose their data, in the familiar form of your operating system file system structure.

Although the hierarchical model provides much greater flexibility than flat models, it is limited in the nature of the relationships it can model. It is supremely capable of representing hierarchical relationships (hence the name), as in parent/child, container/contained, or category/object relationships. Such relationships are often called 'one-to-many' relationships, there being one record to which many others are subordinate. Hierarchical models cannot represent (without some form of link or pointer extensions) multiple inheritance, or so-called 'many-to-many' relationships. Consider the case of a child; All children have two parents, and may have an arbitrary number of children themselves. In a hierarchical model, the child record cannot be in both the mother's collection of children and the father's collection of children, unless it exists twice as record. If we allow records to have multiple instances, we vastly increase the complexity of the database logic. Which record is the actual record? I must search for and update/modify duplicates when changes are made. If I ask for a summation from many records, do I count duplicates only once, and if so how do I detect them as duplicates? It was these inadequacies that gave rise to the relational database model.

Exercises