Ahh ... Scientists! The brightest of the bright, smartest of the smart; Capable of amazing intellectual feats, excluding anything to do with a VCR, or apparently sensible data storage. It has long been known (read: The author is to lazy to look up the original reference, or any reference at all for that matter) that most non-computer science students and academics store their experimental/empirical data in the form of an Excel spreadsheet. This is quite possibly the worst way to store data. First let's examine why Excel is such a popular choice:
Now let's look at the reasons Excel is a bad choice:
Most scientists make an attempt to organise their data. They label all their physical specimens carefully and explicitly. First mistake! The label 'Blood Sample Joe Blogs 2008-03-14 Stage 3 infection' is probably very helpful to someone looking at the name on a computer screen, but try finding that vial of blood when you have only 10 seconds to read the label while it is out of the liquid nitrogen before it thaws too much to be viable. Sure, the label is explicit, and tells one everything one needs to know about the sample, but one can't physically order a bunch of samples with labels of that form, which means the physical sample cannot be found quickly, nor worked with conveniently in electronic form. Also, it can be very difficult to tell 'Blood Sample Joe Blogs 2008-03-14 Stage 3 infection' apart from 'Blood Sample Joe Blogs 2008-03-18 Stage 3 infection', or 'Blood Sample Joe Blags 2008-03-14 Stage 3 infection'.
Thus we introduce the case for the unique identifier. A unique identifier need bear no semantic relation to the thing it identifies. It is just a number, usually, and can always be mapped to a number. This gives one two distinct advantages:
To such a unique identifier, we can attach additional attributes or information, which brings us back to the concept of records (now indexed by a unique identifier) and fields.
The relational database model deals with abstract data management and storage. Most of the abstraction involves changing terminology so it doesn't sound like every other database model. A relational database consists of one or more tables (more formally known as relations). Each table has rows (records, or formally tuples), and each row in a given table has identical columns (fields, or formally attributes) to all other rows in the same table. So we have something that looks like a spreadsheet, but with some important differences. In a spreadsheet one row may use more or less columns than another, whereas a relational database table does not allow this. Also, all columns in a table have a type, e.g. text, integer, float, etc... This is a property of the column as a whole, not a field in an individual row, meaning all data values stored in a particular column in a particular table are of the same type.
Consider the case of a medical trial. The trials need to keep records of the patients involved, the medications administered, the date of administrations, contact details, and their states of health at various time points. For the moment we'll just look at the patients themselves.
patient_id | surname | firstname | initials | phone | address |
---|---|---|---|---|---|
integer | char(64) | char(64) | char(6) | char(12) | text |
1 | Smith | Alice | C | 0123456789 | 1 The Dung Heap, Putney, 4231 |
2 | Black | Barry | J | 0239876543 | 42 Enigma Lane, Illford, 5017 |
3 | Johnson | Charles | NULL | 0192837465 | 32 Bit Boulevard, Heartfordshire, 4870 |
Examining the table above, we can observe a number of things. Firstly, there is a unique identifier, namely the 'patient_id'. The patient themselves may never know this number, but it allows the database system to distinguish between two different Alice C. Smith's, if the case were to arise.
Each table in a relational database can have a primary key. The primary key is always unique amongst the rows of the table, and is thus a unique identifier. It is usually a number, but can be any sortable type. It can consists of multiple columns too, but we'll get to that later. Depending on the database software being used, primary keys that are of numeric type can be automatically generated as new rows are created, leaving the hassle of determining a new unique primary key for a row up to the database software and not the programmer. Methods for doing this are implementation specific. In our example above we would choose 'patient_id' as our primary key for the table.
Considering the information we want to store, the patient table is woefully inadequate. What about their state of health, or their medication records? Well, we could add columns to the patient table, for example a column named 'medication_date', and another, 'dosage'. But this would be the wrong approach. We would either limit ourselves to exactly one administration of medication, or alternatively duplicate data, and end up with diverging information about the 'same' record.
Let's consider the first option. If we wanted to store multiple administration events, we would have to add columns to some rows, as those events happened, leaving other rows alone. But this violates the principle that all rows in a table have exactly the same columns. No dice! If we extended the entire table, we would have a number of rows with missing data — less of a problem — but also not desirable.
The second option is to make each row its own administration event. So every time Barry Black gets dosed, he gets a row in the table, with the additional columns 'dosage' and 'medication_date'. Two rows means he's been dosed twice, three rows, thrice, etc... The problem here is that each time we create a new row for Barry, there's a chance for human error. Barry Black might have been dosed three times, but the system reflects only twice, while the mysterious Barry Block has also been dosed (according to the system), but no one knows who he is.
The solution, of course, is to create another table. Let's call it 'Dose'. It records the date of a dose, an arbitrary unique id for the dose (for example a vial bar code), the dosage amount, and of course to whom the dose was given.
dose_id | date | amount | patient |
---|---|---|---|
integer | date | float | integer (Patient) |
1 | 2008/08/02 | 3.2 | 2 |
2 | 2008/08/02 | 3.2 | 1 |
3 | 2008/08/02 | 3.2 | 3 |
4 | 2008/08/12 | 2.5 | 2 |
So we see from this table that everyone arrived on 2008/08/02 for their initial dose, but that so far only Barry Black has arrived for his second dose, 10 days later, and that the second dose is less than the first (presumably to suit the purposes of the trial).
The biggest concept to grasp here is that of the foreign key. Note the fourth column, 'patient', has type integer, and is 'tagged' as pointing to the Patient table. In relational database terms such a column is called a foreign key, and this implies a number of things:
Foreign keys and the separation of data into various tables provide us with a number of pros over the other two possible methods suggested before. Firstly, when we create a new row in Dose, we simply 'link' it to the appropriate patient, meaning there's far less chance of human error. In fact the only human error that can occur is that a dose may be linked to the wrong patient, or perhaps have the amount or date recorded incorrectly. Secondly, the number of columns in both tables remains the same, both for every row in each table respectively, and for each table over time. We don't have to change the structure of our data.
Using foreign keys and an appropriate structure for our data, we can answer a number of questions simply and quickly, for example:
If we look at question 2 immediately above, we see something a little strange, namely the comparison of a value in a single row (patient_id) to all the values in a particular column of an entire table (Dose.patient). Well, that's what it looks like because of the way we phrased the question, but in actual fact particular rows in Dose are chosen for comparison implicitly. Technically, what happens is a joining of the two tables on certain conditions, in this case the condition that Patient.patient_id = Dose.patient.
A join involves the implicit creation of a temporary table, with columns from both source tables, and rows filled out with data from the rows of the source tables that match specified conditions. This is best explained by example.
Rephrasing question 2, let's display all rows from Dose where the dose_id is even (i.e. they're placebo doses) and append the information from the Patient table of the patient whose patient_id matches the value in the patient column of each found row from dose. Now, reading from left to right, we first get rows 2 and 4 from Dose. Let's put these rows into a new table called 'Result'.
dose_id | date | amount | patient |
---|---|---|---|
integer | date | float | integer (Patient) |
2 | 2008/08/02 | 3.2 | 1 |
4 | 2008/08/12 | 2.5 | 2 |
Continuing, we want to append information from the Patient Table. For the moment we are only interested in names. So let's add the column's to 'Result'.
dose_id | date | amount | patient | firstname | surname |
---|---|---|---|---|---|
integer | date | float | integer (Patient) | char(64) | char(64) |
2 | 2008/08/02 | 3.2 | 1 | NULL | NULL |
4 | 2008/08/12 | 2.5 | 2 | NULL | NULL |
Now that we have added the columns, we need to fill them with the correct data, replacing the NULLs. Relational databases all have a special value 'NULL' which is similar to None in python. It is typeless, and as such any column can have NULL values regardless of its type. You can specify the NULL values are not allowed in certain columns, and this is implicit in some cases, e.g. primary keys cannot be NULL. Technically, the NULL values are never created during an actual join, and are visible here only for illustrative purposes.
We need to fill in the NULLs with the information from rows from the Patient table whose patient_id matches the value of the column 'patient' in the row we are trying to fill in. So, for the first row's NULLs we take the first name and surname from row 1 of Patient, i.e. Alice Smith. Similarly, for the second row of the Result table, we fill in 'Barry' and 'Black'.
dose_id | date | amount | patient | firstname | surname |
---|---|---|---|---|---|
integer | date | float | integer (Patient) | char(64) | char(64) |
2 | 2008/08/02 | 3.2 | 1 | Alice | Smith |
4 | 2008/08/12 | 2.5 | 2 | Barry | Black |
You might be surprised at some point by what looks like the creation of extra rows out of nowhere when performing a join, or rather, you might get more rows than you expect. For example, what if one wanted to see the information about Barry Black, and his dosage history. We might expect one row, since we're only interested in one patient. But Barry has been dosed twice. The complete joined result table would look like this:
patient_id | surname | firstname | initials | phone | address | dose_id | date | amount | patient |
---|---|---|---|---|---|---|---|---|---|
integer | char(64) | char(64) | char(6) | char(12) | text | integer | date | float | integer (Patient) |
2 | Black | Barry | J | 0239876543 | 42 Enigma Lane, Illford, 5017 | 1 | 2008/08/02 | 3.2 | 2 |
2 | Black | Barry | J | 0239876543 | 42 Enigma Lane, Illford, 5017 | 4 | 2008/08/12 | 2.5 | 2 |
Clearly, there are two rows for one patient. The alternative would be to have one row with a variable number of columns, but this presents problems. Firstly, we might have multiple columns with the same names, and no way to tell which columns should be grouped together, e.g. which dose id belongs to which date. Also, it is programatically simpler to deal with multiple rows with a consistent number of columns, than the opposite. In addition, the above example highlights some important points about joining tables:
When designing our data structure and choosing which tables we will use, what they will contain, and how they will be linked, it is important to understand how the information in various tables is related. While this is often obvious in a particular case, more abstractly there are only three possible types of relationships that can exist between any two given tables, and relational databases can only really represent two of these.
The many-to-many case cannot be directly represented in a relational databases. If we put the foreign key in the Doctor table, that means patients doctors can only treat one patient, without using a dynamic number of columns in a table; an idea which we've hopefully left behind to rot on the wayside. Similarly, putting the foreign key in the Patient table means a patient can be treated by only one doctor.
The solution is to create a third intermediate table, often called a link table. In this table we have two foreign keys. One to doctor, and one to patient. The presence of a row in the link table pointing to some doctor D and some patient P, means that patient P has been treated by doctor D. Should D treat another patient Q, we simply add a row to the link table (D:Q). Similarly, should patient P be treated by a second doctor (E), we add a row (E:P) to the link table. Because both the doctor and patient columns in the link table are foreign keys, the same doctor can appear multiple times, and the same patient can appear multiple times.
In fact, we can store additional information in link tables. We might, for instance, want to store the dates of treatments. In this case the link table would have the two foreign key columns and a date column. But now what if a patient is treated by the same doctor multiple times. No problem. Create another row with the same doctor/patient pairing, and a different date. It is useful in this case to create an additional column to act as primary key for the link table.
There are also times when we might want to indicate only that two item are linked in some way. An artist may record multiple songs, and similarly a song may be performed by multiple artists in a single recording. We are not interested in how many times a particular artist performed a song though. So we would have the three tables Artist, Song, and PerformedBy. PerformedBy would have only two columns, artist (a foreign key) and song (another foreign key). We want to ensure that we don't record more than once the fact that some artist A has performed the song S. We could make either of the two foreign keys a primary key as well, but that causes problems. If we make artist the primary key, the we can only record one song performance for any given artist. Instead, we can make the primary key consist of both columns in the table. This means the we can have many rows with the same artist, as long as each row has a different song as the other foreign key. Likewise, each sing can appear multiple times as long as each artist paired with it is unique. This combination means we never record the fact that an artist has performed a song more than once.
As always, real life intrudes rudely on the ideals on computational theory. The fact of the matter is that databases change. Not only their contents change, we expect that, and that's what they're designed to deal with anyway. But over time, we might require that their structures be modified to. Due to some change in the law we might suddenly need to keep additional business records about certain types of transactions, for example. Since the inception of relational databases though, we have also had the concept of normalisation, which is a method to reduce to work involved in making structural changes to database.
If we explore the example of a law being changed or introduced requiring additional record keeping, we can quickly tell how difficult it becomes to modify database structure once data already exists. We have a very simple database originally, with only one table, which records the sale of products: product name, price, and time of sale. Sometimes, the additional information we need to record can be added simply. Tax amount? We can simply add a column that records the taxable amount of the sale. But what if the problem is more complex. Suppose we run a hardware store selling some products that are regulated by the government, e.g. detonators for industrial explosives. Currently we are required to keep a record of to whom the sale of such an item was made, and we record it in a column in the sales table. Now, the law changes, and retailers are required not only to record to whom the sale is made, but also to limit the number of detonators a single customer may buy to one per day. Now we're stuck. We can't make the purchaser field unique, because that would mean a particular customer could only but a detonator once, and never again. Also, what about the other products they might want to buy? What we need to do is now separate the purchasers out into their own table, and instead use a link table where we can specify a primary key over the columns: purchaser, sale, and date. Rows are only added to the link table for detonator purchases. Now the same person trying to purchase multiple detonators on the same date will be disallowed by the database system. Perfect solution, but it requires some fairly extensive restructuring of the actual database structure, the creation of new tables, the removal of columns from one table and their inclusion into another, and all this without losing any data, either directly, or implicit, like the fact that the purchaser being moved came originally from the sale with id X on date Y.
Now that we understand the nastiness of getting things wrong, how do we get them right. Enter, the normal forms. The normal forms are each a collection of criteria for your database structure, that if fulfilled will generally ensure that structural changes are restricted to additions instead of deletions, or worse rearrangements and movements. Followed stepwise and in order, and applied to every table, they are as follows:
Accession | Species | Seq | Accession | Species | Seq |
---|---|---|---|---|---|
AF111167 | Homo sapiens | ACTGGTAG | AY82998 | Rattus norvegicus | TGGGTGGCA |
Accession | Species | Seq |
---|---|---|
AF111167 | Homo sapiens | ACTGGTAG |
AY82998 | Rattus norvegicus | TGGGTGGCA |
From the first normal form, separate out into different tables any columns in any tables that are not dependant on the whole primary key of their table.
Experiment | Accession | Species | Seq |
---|---|---|---|
A | AF111167 | Homo sapiens | ACTGGTAG |
A | AY828998 | Rattus norvegicus | TGGGTGGCA |
B | AY828999 | Rattus norvegicus | TGGCCAT |
Experiment | Accession | Seq |
---|---|---|
A | AF111167 | ACTGGTAG |
A | AY828998 | TGGGTGGCA |
B | AY828999 | TGGCCAT |
Accession | Species |
---|---|
AF111167 | Homo sapiens |
AY828998 | Rattus norvegicus |
AY828999 | Rattus norvegicus |
In this example the problem is that species is dependant only on the accession column, not on the entire primary key. This situation rarely comes up.
From the second normal form, separate out into their own respective tables, groups of columns that collectively depend on each other but not on the primary key or alternative candidate keys columns.
Accession | Species | Seq | TaxId |
---|---|---|---|
AF111167 | Homo sapiens | ACTGGTAG | 9606 |
AY828998 | Rattus norvegicus | TGGGTGGCA | 10166 |
AY828999 | Rattus norvegicus | TGGCCAT | 10166 |
Accession | TaxId | Seq |
---|---|---|
AF111167 | 9606 | ACTGGTAG |
AY828998 | 10166 | TGGGTGGCA |
AY828999 | 10166 | TGGCCAT |
TaxId | Species |
---|---|
9606 | Homo sapiens |
10166 | Rattus norvegicus |
For a way more in depth technical discussion on the normal forms (there are more than 3, but only the first three are generally important, no matter what your local database systems lecturer insists), take a look at the the wikipedia entry on database normalisation.
Most relational database software is implemented using a client/server model, where the database software runs as a server, and accepts connections on which the client can send instructions to either examine or modify the contents of the database. This approach has a number of advantages:
What this means to us as budding python programmers is that in order to use a relational database, which is presently the industry standard model for databases, we need to have a relational database server installed where we can reasonably access one. For the remainder of this course we're going to use PostgreSQL as our example software, just because the author of these notes happens to like it better. Where possible, i.e. where the author can actually remember them, MySQL syntax differences will be noted. MySQL is the most widely adopted open source relational database server software, but PostgreSQL is the better for educational purposes and general standards compliance, plus it actually implements cool stuff like triggers, and implicit constraints, and rules, ooh and ...
It is beyond the scope of this material... blah blah blah. We're not going to cover installation of a database server because it tends to be an arcane black art that is highly distribution and OS dependant. Get you're local IT guru to help you out, have pizza and beer on hand for payment.