Introductory Programming in Python: Lesson 33
Relational Databases

[Prev: Database Theory] [Course Outline] [Next: Structured Query Language]

How Scientists Store Data: or Why Spreadsheets are a Bad Idea

Ahh ... Scientists! The brightest of the bright, smartest of the smart; Capable of amazing intellectual feats, excluding anything to do with a VCR, or apparently sensible data storage. It has long been known (read: The author is to lazy to look up the original reference, or any reference at all for that matter) that most non-computer science students and academics store their experimental/empirical data in the form of an Excel spreadsheet. This is quite possibly the worst way to store data. First let's examine why Excel is such a popular choice:

  1. Excel is a GUI.
  2. People are familiar with Excel.
  3. Excel gives the illusion of structure being imposed on the data stored in the form of column names.
  4. Excel can create pretty pictures.
  5. Excel can perform rudimentary data manipulation, e.g. sorting.
  6. Excel can perform rudimentary analysis on the data, e.g. regression.

Now let's look at the reasons Excel is a bad choice:

  1. Excel is a GUI... its primary purpose is not to store and structure data but to analyse it.
  2. Excel does not in fact impart any structure to the data. Regardless of what I name the column, the cells under those columns can contain any data of any data type.
  3. More importantly, Excel does not enforce any structure on the data.
  4. It's ridiculously easy to make large scale incorrect changes to data... Have you ever sorted by a column and discovered the other columns didn't sort with it?
  5. Excel doesn't keep data together in an identifiable structured manner.

Why Unique Identifiers are Useful

Most scientists make an attempt to organise their data. They label all their physical specimens carefully and explicitly. First mistake! The label 'Blood Sample Joe Blogs 2008-03-14 Stage 3 infection' is probably very helpful to someone looking at the name on a computer screen, but try finding that vial of blood when you have only 10 seconds to read the label while it is out of the liquid nitrogen before it thaws too much to be viable. Sure, the label is explicit, and tells one everything one needs to know about the sample, but one can't physically order a bunch of samples with labels of that form, which means the physical sample cannot be found quickly, nor worked with conveniently in electronic form. Also, it can be very difficult to tell 'Blood Sample Joe Blogs 2008-03-14 Stage 3 infection' apart from 'Blood Sample Joe Blogs 2008-03-18 Stage 3 infection', or 'Blood Sample Joe Blags 2008-03-14 Stage 3 infection'.

Thus we introduce the case for the unique identifier. A unique identifier need bear no semantic relation to the thing it identifies. It is just a number, usually, and can always be mapped to a number. This gives one two distinct advantages:

To such a unique identifier, we can attach additional attributes or information, which brings us back to the concept of records (now indexed by a unique identifier) and fields.

The Relational Database Model: Tables and Columns

The relational database model deals with abstract data management and storage. Most of the abstraction involves changing terminology so it doesn't sound like every other database model. A relational database consists of one or more tables (more formally known as relations). Each table has rows (records, or formally tuples), and each row in a given table has identical columns (fields, or formally attributes) to all other rows in the same table. So we have something that looks like a spreadsheet, but with some important differences. In a spreadsheet one row may use more or less columns than another, whereas a relational database table does not allow this. Also, all columns in a table have a type, e.g. text, integer, float, etc... This is a property of the column as a whole, not a field in an individual row, meaning all data values stored in a particular column in a particular table are of the same type.

Consider the case of a medical trial. The trials need to keep records of the patients involved, the medications administered, the date of administrations, contact details, and their states of health at various time points. For the moment we'll just look at the patients themselves.

The 'Patient' Table
patient_id surname firstname initials phone address
integer char(64) char(64) char(6) char(12) text
1 Smith Alice C 0123456789 1 The Dung Heap, Putney, 4231
2 Black Barry J 0239876543 42 Enigma Lane, Illford, 5017
3 Johnson Charles NULL 0192837465 32 Bit Boulevard, Heartfordshire, 4870

Examining the table above, we can observe a number of things. Firstly, there is a unique identifier, namely the 'patient_id'. The patient themselves may never know this number, but it allows the database system to distinguish between two different Alice C. Smith's, if the case were to arise.

Primary Keys

Each table in a relational database can have a primary key. The primary key is always unique amongst the rows of the table, and is thus a unique identifier. It is usually a number, but can be any sortable type. It can consists of multiple columns too, but we'll get to that later. Depending on the database software being used, primary keys that are of numeric type can be automatically generated as new rows are created, leaving the hassle of determining a new unique primary key for a row up to the database software and not the programmer. Methods for doing this are implementation specific. In our example above we would choose 'patient_id' as our primary key for the table.

Relationships Between Tables and Foreign Keys

Considering the information we want to store, the patient table is woefully inadequate. What about their state of health, or their medication records? Well, we could add columns to the patient table, for example a column named 'medication_date', and another, 'dosage'. But this would be the wrong approach. We would either limit ourselves to exactly one administration of medication, or alternatively duplicate data, and end up with diverging information about the 'same' record.

Let's consider the first option. If we wanted to store multiple administration events, we would have to add columns to some rows, as those events happened, leaving other rows alone. But this violates the principle that all rows in a table have exactly the same columns. No dice! If we extended the entire table, we would have a number of rows with missing data — less of a problem — but also not desirable.

The second option is to make each row its own administration event. So every time Barry Black gets dosed, he gets a row in the table, with the additional columns 'dosage' and 'medication_date'. Two rows means he's been dosed twice, three rows, thrice, etc... The problem here is that each time we create a new row for Barry, there's a chance for human error. Barry Black might have been dosed three times, but the system reflects only twice, while the mysterious Barry Block has also been dosed (according to the system), but no one knows who he is.

The solution, of course, is to create another table. Let's call it 'Dose'. It records the date of a dose, an arbitrary unique id for the dose (for example a vial bar code), the dosage amount, and of course to whom the dose was given.

The 'Dose' Table
dose_id date amount patient
integer date float integer (Patient)
1 2008/08/02 3.2 2
2 2008/08/02 3.2 1
3 2008/08/02 3.2 3
4 2008/08/12 2.5 2

So we see from this table that everyone arrived on 2008/08/02 for their initial dose, but that so far only Barry Black has arrived for his second dose, 10 days later, and that the second dose is less than the first (presumably to suit the purposes of the trial).

The biggest concept to grasp here is that of the foreign key. Note the fourth column, 'patient', has type integer, and is 'tagged' as pointing to the Patient table. In relational database terms such a column is called a foreign key, and this implies a number of things:

Foreign keys and the separation of data into various tables provide us with a number of pros over the other two possible methods suggested before. Firstly, when we create a new row in Dose, we simply 'link' it to the appropriate patient, meaning there's far less chance of human error. In fact the only human error that can occur is that a dose may be linked to the wrong patient, or perhaps have the amount or date recorded incorrectly. Secondly, the number of columns in both tables remains the same, both for every row in each table respectively, and for each table over time. We don't have to change the structure of our data.

Using foreign keys and an appropriate structure for our data, we can answer a number of questions simply and quickly, for example:

  1. How many doses has patient X received? (count number of rows in 'Dose' where patient = X)
  2. If, for example, every even numbered dose is a placebo, then which patients received placebos? (list rows from Patient where patient_id = Dose.patient and dose_id is even)
  3. How may doses were given out between specified dates?
  4. On what dates did patient X receive their doses?

Joining Tables to Query Data

If we look at question 2 immediately above, we see something a little strange, namely the comparison of a value in a single row (patient_id) to all the values in a particular column of an entire table (Dose.patient). Well, that's what it looks like because of the way we phrased the question, but in actual fact particular rows in Dose are chosen for comparison implicitly. Technically, what happens is a joining of the two tables on certain conditions, in this case the condition that Patient.patient_id = Dose.patient.

A join involves the implicit creation of a temporary table, with columns from both source tables, and rows filled out with data from the rows of the source tables that match specified conditions. This is best explained by example.

Rephrasing question 2, let's display all rows from Dose where the dose_id is even (i.e. they're placebo doses) and append the information from the Patient table of the patient whose patient_id matches the value in the patient column of each found row from dose. Now, reading from left to right, we first get rows 2 and 4 from Dose. Let's put these rows into a new table called 'Result'.

The 'Result' Table
dose_id date amount patient
integer date float integer (Patient)
2 2008/08/02 3.2 1
4 2008/08/12 2.5 2

Continuing, we want to append information from the Patient Table. For the moment we are only interested in names. So let's add the column's to 'Result'.

The new 'Result' Table
dose_id date amount patient firstname surname
integer date float integer (Patient) char(64) char(64)
2 2008/08/02 3.2 1 NULL NULL
4 2008/08/12 2.5 2 NULL NULL

Now that we have added the columns, we need to fill them with the correct data, replacing the NULLs. Relational databases all have a special value 'NULL' which is similar to None in python. It is typeless, and as such any column can have NULL values regardless of its type. You can specify the NULL values are not allowed in certain columns, and this is implicit in some cases, e.g. primary keys cannot be NULL. Technically, the NULL values are never created during an actual join, and are visible here only for illustrative purposes.

We need to fill in the NULLs with the information from rows from the Patient table whose patient_id matches the value of the column 'patient' in the row we are trying to fill in. So, for the first row's NULLs we take the first name and surname from row 1 of Patient, i.e. Alice Smith. Similarly, for the second row of the Result table, we fill in 'Barry' and 'Black'.

The final 'Result' Table
dose_id date amount patient firstname surname
integer date float integer (Patient) char(64) char(64)
2 2008/08/02 3.2 1 Alice Smith
4 2008/08/12 2.5 2 Barry Black

You might be surprised at some point by what looks like the creation of extra rows out of nowhere when performing a join, or rather, you might get more rows than you expect. For example, what if one wanted to see the information about Barry Black, and his dosage history. We might expect one row, since we're only interested in one patient. But Barry has been dosed twice. The complete joined result table would look like this:

Barry Black's Resultant Join
patient_id surname firstname initials phone address dose_id date amount patient
integer char(64) char(64) char(6) char(12) text integer date float integer (Patient)
2 Black Barry J 0239876543 42 Enigma Lane, Illford, 5017 1 2008/08/02 3.2 2
2 Black Barry J 0239876543 42 Enigma Lane, Illford, 5017 4 2008/08/12 2.5 2

Clearly, there are two rows for one patient. The alternative would be to have one row with a variable number of columns, but this presents problems. Firstly, we might have multiple columns with the same names, and no way to tell which columns should be grouped together, e.g. which dose id belongs to which date. Also, it is programatically simpler to deal with multiple rows with a consistent number of columns, than the opposite. In addition, the above example highlights some important points about joining tables:

Cardinality of Relationships

When designing our data structure and choosing which tables we will use, what they will contain, and how they will be linked, it is important to understand how the information in various tables is related. While this is often obvious in a particular case, more abstractly there are only three possible types of relationships that can exist between any two given tables, and relational databases can only really represent two of these.

One-to-one:
One row in Table A is linked to at most one row in Table B. This is represented using a foreign key in Table B that points to the primary key of Table A, and has a uniqueness constraint placed on the foreign key in Table B. The foreign key may allow NULL values, or may not. This type of relationship is most often used to represent some form of class inheritance of attributes, e.g. Given three tables to record information about staff, namely Staff, Doctors, Nurses. We want to store the same basic information about both nurses and doctors, but for each type there are profession specific details as well. Both the doctors and nurses table link to staff in a one-to-one manner, as a staff member cannot exist twice, but the doctors and nurses tables respectively store the profession specific information required.
One-to-Many
One row in Table A is linked to potentially multiple rows in Table B. This is represented using a foreign key in Table B that points to the primary key of Table A, with no uniqueness constraint. Again NULL values for the foreign key may or may not be allowed on a case by case basis. Our previous example of dosage and patients is a one-to-many relationship.
Many-to-Many
Potentially multiple rows in Table A are linked to potentially multiple rows in Table B. Continuing with the medical theme of examples, a doctor treats many patients, represented by a foreign key in the Patient table to the primary key of the Doctor Table. However, a single patient may be treated by many doctors over the long term.

The many-to-many case cannot be directly represented in a relational databases. If we put the foreign key in the Doctor table, that means patients doctors can only treat one patient, without using a dynamic number of columns in a table; an idea which we've hopefully left behind to rot on the wayside. Similarly, putting the foreign key in the Patient table means a patient can be treated by only one doctor.

The solution is to create a third intermediate table, often called a link table. In this table we have two foreign keys. One to doctor, and one to patient. The presence of a row in the link table pointing to some doctor D and some patient P, means that patient P has been treated by doctor D. Should D treat another patient Q, we simply add a row to the link table (D:Q). Similarly, should patient P be treated by a second doctor (E), we add a row (E:P) to the link table. Because both the doctor and patient columns in the link table are foreign keys, the same doctor can appear multiple times, and the same patient can appear multiple times.

In fact, we can store additional information in link tables. We might, for instance, want to store the dates of treatments. In this case the link table would have the two foreign key columns and a date column. But now what if a patient is treated by the same doctor multiple times. No problem. Create another row with the same doctor/patient pairing, and a different date. It is useful in this case to create an additional column to act as primary key for the link table.

There are also times when we might want to indicate only that two item are linked in some way. An artist may record multiple songs, and similarly a song may be performed by multiple artists in a single recording. We are not interested in how many times a particular artist performed a song though. So we would have the three tables Artist, Song, and PerformedBy. PerformedBy would have only two columns, artist (a foreign key) and song (another foreign key). We want to ensure that we don't record more than once the fact that some artist A has performed the song S. We could make either of the two foreign keys a primary key as well, but that causes problems. If we make artist the primary key, the we can only record one song performance for any given artist. Instead, we can make the primary key consist of both columns in the table. This means the we can have many rows with the same artist, as long as each row has a different song as the other foreign key. Likewise, each sing can appear multiple times as long as each artist paired with it is unique. This combination means we never record the fact that an artist has performed a song more than once.

Normalisation

As always, real life intrudes rudely on the ideals on computational theory. The fact of the matter is that databases change. Not only their contents change, we expect that, and that's what they're designed to deal with anyway. But over time, we might require that their structures be modified to. Due to some change in the law we might suddenly need to keep additional business records about certain types of transactions, for example. Since the inception of relational databases though, we have also had the concept of normalisation, which is a method to reduce to work involved in making structural changes to database.

If we explore the example of a law being changed or introduced requiring additional record keeping, we can quickly tell how difficult it becomes to modify database structure once data already exists. We have a very simple database originally, with only one table, which records the sale of products: product name, price, and time of sale. Sometimes, the additional information we need to record can be added simply. Tax amount? We can simply add a column that records the taxable amount of the sale. But what if the problem is more complex. Suppose we run a hardware store selling some products that are regulated by the government, e.g. detonators for industrial explosives. Currently we are required to keep a record of to whom the sale of such an item was made, and we record it in a column in the sales table. Now, the law changes, and retailers are required not only to record to whom the sale is made, but also to limit the number of detonators a single customer may buy to one per day. Now we're stuck. We can't make the purchaser field unique, because that would mean a particular customer could only but a detonator once, and never again. Also, what about the other products they might want to buy? What we need to do is now separate the purchasers out into their own table, and instead use a link table where we can specify a primary key over the columns: purchaser, sale, and date. Rows are only added to the link table for detonator purchases. Now the same person trying to purchase multiple detonators on the same date will be disallowed by the database system. Perfect solution, but it requires some fairly extensive restructuring of the actual database structure, the creation of new tables, the removal of columns from one table and their inclusion into another, and all this without losing any data, either directly, or implicit, like the fact that the purchaser being moved came originally from the sale with id X on date Y.

Now that we understand the nastiness of getting things wrong, how do we get them right. Enter, the normal forms. The normal forms are each a collection of criteria for your database structure, that if fulfilled will generally ensure that structural changes are restricted to additions instead of deletions, or worse rearrangements and movements. Followed stepwise and in order, and applied to every table, they are as follows:

The First Normal Form

  1. Each row is unique, i.e. no two rows in a table are entirely identical. Easily done by making sure every table has a primary key.
  2. All columns in a table consist of value of consistent type/kind. Normally taken care of database software enforcing typing on columns.
  3. All columns are single-valued, i.e. no column has tuple like or similarly compound data.
  4. The order of the rows is insignificant.
  5. The order of the columns is insignificant.
Before First Normal form
Accession Species Seq Accession Species Seq
AF111167 Homo sapiens ACTGGTAG AY82998 Rattus norvegicus TGGGTGGCA
After First Normal form
Accession Species Seq
AF111167 Homo sapiens ACTGGTAG
AY82998 Rattus norvegicus TGGGTGGCA

The Second Normal Form

From the first normal form, separate out into different tables any columns in any tables that are not dependant on the whole primary key of their table.

Before Second Normal Form
Experiment Accession Species Seq
A AF111167 Homo sapiens ACTGGTAG
A AY828998 Rattus norvegicus TGGGTGGCA
B AY828999 Rattus norvegicus TGGCCAT

Sequences: After Second Normal Form
Experiment Accession Seq
A AF111167 ACTGGTAG
A AY828998 TGGGTGGCA
B AY828999 TGGCCAT
Species: After Second Normal Form
Accession Species
AF111167 Homo sapiens
AY828998 Rattus norvegicus
AY828999 Rattus norvegicus

In this example the problem is that species is dependant only on the accession column, not on the entire primary key. This situation rarely comes up.

The Third Normal Form

From the second normal form, separate out into their own respective tables, groups of columns that collectively depend on each other but not on the primary key or alternative candidate keys columns.

Before Third Normal Form
Accession Species Seq TaxId
AF111167 Homo sapiens ACTGGTAG 9606
AY828998 Rattus norvegicus TGGGTGGCA 10166
AY828999 Rattus norvegicus TGGCCAT 10166

Sequence: After Third Normal Form
Accession TaxId Seq
AF111167 9606 ACTGGTAG
AY828998 10166 TGGGTGGCA
AY828999 10166 TGGCCAT
Species: After Third Normal Form
TaxId Species
9606 Homo sapiens
10166 Rattus norvegicus

Further Reading

For a way more in depth technical discussion on the normal forms (there are more than 3, but only the first three are generally important, no matter what your local database systems lecturer insists), take a look at the the wikipedia entry on database normalisation.

The Client/Server Model

Most relational database software is implemented using a client/server model, where the database software runs as a server, and accepts connections on which the client can send instructions to either examine or modify the contents of the database. This approach has a number of advantages:

What this means to us as budding python programmers is that in order to use a relational database, which is presently the industry standard model for databases, we need to have a relational database server installed where we can reasonably access one. For the remainder of this course we're going to use PostgreSQL as our example software, just because the author of these notes happens to like it better. Where possible, i.e. where the author can actually remember them, MySQL syntax differences will be noted. MySQL is the most widely adopted open source relational database server software, but PostgreSQL is the better for educational purposes and general standards compliance, plus it actually implements cool stuff like triggers, and implicit constraints, and rules, ooh and ...

It is beyond the scope of this material... blah blah blah. We're not going to cover installation of a database server because it tends to be an arcane black art that is highly distribution and OS dependant. Get you're local IT guru to help you out, have pizza and beer on hand for payment.

Exercises

[Prev: Database Theory] [Course Outline] [Next: Structured Query Language]