Introductory Programming in Python: Lesson 34
Structured Query Language

[Prev: Relational Databases] [Course Outline] [Next: Interfacing with Databases using Python]

This lesson deals with standard structured query language (SQL). SQL is a not a programming language as such, but rather, as its name suggests, a query language. It is a language designed specifically to let one phrase a complex question in a human readable but also machine parsable manner, that can extract arbitrary collection of data from a given database.

The details of database administration are not part of the SQL standard, and as such not covered in this lesson (or this course for that matter). They are different for every implementation of SQL, and covered by each implementation's documentation. For the moment, it is assumed that you have an account on a database already set up. This lesson assumes a PostgreSQL database.

Starting the database client

The first thing to do is start up your database client. In a terminal, or command line window if using windows (Why are you still using Windows?), type psql for PostgreSQL:

sirlark@hephaestus ~ $ psql
Welcome to psql 8.0.15, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

sirlark=>

You are now connected to the PostgreSQL server, and are using the database named <your username> (in this case sirlark).

For MySQL type mysql -D <your username> -p:

sirlark@hephaestus ~ $ mysql -D sirlark -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.0.60-log Gentoo Linux mysql-5.0.60-r1

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql>

Creating a new database

While the creation of a new database is nominally an administrative issue, it is covered by the SQL standard. Quite simply we use the command CREATE DATABASE <database name>;. SQL commands are not case sensitive, though table names and column names may be depending on implementation, however it is common practice to use uppercase for SQL specific keywords to make your queries more legible. Note the trailing semicolon. SQL is insensitive to the amount and nature of whitespace used, i.e. all whitespace is considered equivalent. As such a newline does not end a command. Instead all commands end is a semicolon.

Creating Tables

The first thing we want to do with any relational database is to specify its structure. We do this by creating tables, which initially contain no rows. The postgres client allows us to list all the tables present in the database using the command \dt (The MySQL equivalent is SHOW TABLES).

Creating a new table is as simple as:

CREATE TABLE <table name> (
<column name> <type> [<constraints>] [DEFAULT <default value>] [,
...
]
);

Columns Types

Like python, relational databases have type, except they enforce the type of data present in a column of a table. Different implementations provide various different types, but the SQL standard specifies a shared collection of types available in all implementations.

[UNSIGNED] INTEGER: Nuf' said.
FLOAT: Ditto
DOUBLE: A bigger float
NUMERIC(<precision>[, <scale>]): Numeric data with guaranteed precision. Precision is the total number of digits (maximum 1000), and scale is the number of digits to the right of the decimal point (default 0). NUMERIC data is slow to work with, compared to INTEGER and FLOAT, but provides guaranteed accuracy.
VARCHAR(X): String data up to X characters in length. X must be less than 256.
CHAR(X): String data up to X characters in length. Regardless of how many characters are present in the string X characters are stored on disk. Faster than VARCHAR, but uses more space. X must be less than 256.
TEXT: String data of essentially indefinite length. Maximum length varies with implementation.
DATE: A date
TIME: A time
YEAR: A four digit year
TIMESTAMP [{WITH|WITHOUT} TIME ZONE]: A date and a time, optionally with automatic timezone conversion, which is the default.
INTERVAL: A length of time, e.g. 3 days and 4 hours
BOOLEAN: Either TRUE or FALSE

There are many more standard SQL types, which you can find in the SQL standard, or in your database software documentation. The above types however will cover most of your bases.

Column Constraints

Although every column we specify must have a type, constraints are optional. Constraints are rules we specify that constrain not the type of data that may be stored in a column, but the content of that data. Constraints can be specified in any order. Some common constraint keywords are:

PRIMARY KEY: Indicates that this column should be considered the primary key of the table.
UNIQUE: Indicates that entries in this column should be unique within this column of this table.
[NOT] NULL: Species whether the column can have NULL values (the default), or whether every row must have an entry for this column.
REFERENCES <table name >: Specifies this the contents of this column are foreign keys reference the primary key of the table 'table name'. No values can be stored in this column unless they exist in the primary key column of the table 'table name'.

Specialised Primary Key Mechanisms

As mentioned before most database software allows the automatic generation of unique primary keys, however different software implementations do this in different ways, and there is no standard way to specify such behaviour in standard SQL.

In PostgreSQL a special column type is provided, called SERIAL. Making a column, not necessarily a primary key, of type SERIAL will assign a default value that starts at 1 and increments by 1 for every row inserted into the table that doesn't specify a value for that column.

CREATE TABLE my_table (
	id SERIAL PRIMARY KEY,
	somestuff TEXT
);

The same functionality can be achieved, but only for columns which are primary keys, in MySQL using the AUTO_INCREMENT qualifier (which is essentially a constraint).

CREATE TABLE my_table (
	id INTEGER AUTO_INCREMENT PRIMARY KEY,
	somestuff TEXT
);

Default Values

In addition to specifying column name, type and constraints, one may specify a default value for cases where rows are inserted without specifying the column in question. If no default is specified, the value becomes NULL. Defaults are specified with the in the form DEFAULT <value>, generally after any constraints.

Indexing

Finally, one can specify that an index be created for the table on the specified column, using the keyword INDEX. This is implied by the use of PRIMARY KEY, but other columns can be indexed for cases where they will be used often in queries. Indexing speeds up access and querying of the table when conditions using indexed columns are specified. The actual mechanisms of indexing and how they work are otherwise not worth going into in detail unless your intent is to become a professional database software developer.

Table Constraints

After all columns have been specified, additional table constraints can be specified. Most constraints can be specified using either column constraints or table constraints, but there are some constraints which are restricted by syntax to being specified only as table constraints. In particular, constraints that span multiple columns, as in the multi-column primary key constraints of link tables used to represent many-to-many relationships. Some of these constraints are:

CONSTRAINT <constraint name> PRIMARY KEY (<column name>[, <column name>[, ...]): Acts in the same manner as the PRIMARY KEY column constraint when a single column is specified, but also allows primary keys of multiple columns to be specified.
CONSTRAINT <constraint name> FOREIGN KEY (<column name>[, <column name>[, ...]) REFERENCES (<table name>): Acts in the same manner as the REFERENCES column constraint when a single column is specified, but an be used to specify a combination of columns that act in concert as a single foreign key referencing a table with a primary key composed of the same number of columns and the same respective column types.

As an aside, it should be mentioned that as of this writing, MySQL does not enforce column foreign key constraints, but does parse them and ignore them without warning or comment.

An Example Database Creation Script

Let's get to practical grips with SQL now. Save the following listing to a file called library.sql, and pipe it into you database client... psql < library.sql. If you're using MySQL you will need to change all instances of 'SERIAL' to 'INTEGER AUTO_INCREMENT' in library.sql and use this command instead: mysql -p < library.sql. This SQL script defines a new database to deal with the management of a small library, including searching for books by category, author name, or title, and managing their withdrawal and return by members.

CREATE TABLE Category (
	id SERIAL PRIMARY KEY,
	name VARCHAR(255) NOT NULL
);

CREATE TABLE Book (
	id SERIAL PRIMARY KEY,
	title VARCHAR(255) NOT NULL,
	category INTEGER REFERENCES Category NOT NULL
);

CREATE TABLE Author (
	id SERIAL PRIMARY KEY,
	surname VARCHAR(128) NOT NULL,
	firstname VARCHAR(128) NOT NULL
);

CREATE TABLE Authorship (
	author INTEGER REFERENCES Author NOT NULL,
	book INTEGER REFERENCES Book NOT NULL,
	PRIMARY KEY (author, book)
);

CREATE TABLE Member (
	id SERIAL PRIMARY KEY,
	surname VARCHAR(128) NOT NULL,
	firstname VARCHAR(128) NOT NULL,
	address TEXT,
	phone CHAR(12),
	fax CHAR(12),
	signup_date TIMESTAMP DEFAULT NOW()
);

CREATE TABLE Withdrawal (
	id SERIAL PRIMARY KEY,
	withdrawal TIMESTAMP NOT NULL,
	return TIMESTAMP NULL,
	book INTEGER REFERENCES Book NOT NULL,
	member INTEGER REFERENCES Member NOT NULL
);

Inserting Rows

Of course simply creating a data structure is not helpful unless we can populate it with data. Thus we come to the insertion of rows.

INSERT INTO <table name> 
	(<column name>[, <column name>[, ...]]) 
VALUES 
	(<value>[, <value>[, ...]]);

Very simply, the INSERT statement inserts a new rows into the table 'table name'. In the newly inserted row, the columns listed by name will be populated with the values listed, by respective position. String and text data needs to be enclosed in single quotes, integers, floats, numerics and boolean need not. Date and times also need to have single quotes and are always in 'YYYY-MM-DD' format for dates, and 'hh:mm:ss' for times, and 'YYYY-MM-DD hh:mm:ss' for timestamps.

Continuing with our library example, let's use the INSERT statement to populate out database with some information.

INSERT INTO Category (name) VALUES ('Fiction');
INSERT INTO Category (name) VALUES ('Non-Fiction');
INSERT INTO Category (name) VALUES ('Reference');

INSERT INTO Book (title, category) VALUES ('Introductory Programming in Python', 2);
INSERT INTO Book (title, category)
	VALUES ('Lesser Known Molluscs of the Karoo Flood Plain', 2);
INSERT INTO Book (title, category) VALUES ('Advanced Programming Techniques', 2);
INSERT INTO Book (title, category) VALUES ('PostgreSQL Reference Manual', 3);
INSERT INTO Book (title, category) VALUES ('Pride & Prejudice', 1);
INSERT INTO Book (title, category) VALUES ('The Time Machine', 1);
INSERT INTO Book (title, category) VALUES ('Day of the Triffids', 1);
INSERT INTO Book (title, category) VALUES ('The Bogus Book', 1);

INSERT INTO Author (surname, firstname) VALUES ('Dominy', 'James');
INSERT INTO Author (surname, firstname) VALUES ('Wells', 'Herbert George');
INSERT INTO Author (surname, firstname) VALUES ('Wyndham', 'John');
INSERT INTO Author (surname, firstname) VALUES ('Snailman', 'Fred B');
INSERT INTO Author (surname, firstname) VALUES ('Newman', 'Alfred E');
INSERT INTO Author (surname, firstname) VALUES ('Austen', 'Jane');

INSERT INTO Authorship (author, book) VALUES (1, 1);
INSERT INTO Authorship (author, book) VALUES (1, 2);
INSERT INTO Authorship (author, book) VALUES (5, 2);
INSERT INTO Authorship (author, book) VALUES (1, 3);
INSERT INTO Authorship (author, book) VALUES (5, 4);
INSERT INTO Authorship (author, book) VALUES (6, 5);
INSERT INTO Authorship (author, book) VALUES (2, 6);
INSERT INTO Authorship (author, book) VALUES (3, 7);

Updating Rows

We can modify the contents of rows that already exist in a table using the UPDATE statement.

UPDATE <table name>
	SET <column name> = <value>[, <column name> = <value>[,...]]
	[WHERE <condition>];

The UPDATE statements updates the contents of all rows in the specified table, unless a WHERE clause is given, in which case only rows that match the condition are updated. Usually the condition will be of the form WHERE prikey = <some id>, but more general updates can be useful. For example one might want to update all rows in a table that have a date column with values before 5 years ago, such that their 'archived' column becomes true, because they have cycled out of the active tax storage requirements.

By the way, in our example library database, one of the authorship records is wrong, so we should update it. James Dominy knows nothing about molluscs, and Fred Snailman is in fact the correct author.

UPDATE Authorship SET author = 4 WHERE author = 1 AND book = 2;

Deleting Rows

Obviously, over time we will need to delete records from out database. Again, simply done with the DELETE statement.

DELETE FROM <table name> [WHERE <condition>];

Note that the DELETE statement deletes all rows from the table unless a WHERE clause is provided, in which case only those rows matching the specified condition are removed. Rows cannot be removed if they are referred to by a foreign key in another table, i.e. removal is blocked if such a removal would cause a violation of the foreign key constraints of the database.

Let's get rid of the bogus book in our example library database.

DELETE FROM Book WHERE id = 8;

Querying Information

The true power behind SQL is the ability to dynamically specify what information we want from the database. This is done using the SELECT statement. The SELECT statement is possibly the most complex statement in the SQL language, so we'll deal with it by example, starting simple, and expanding from there.

Simple SELECT Statements

We'll start out with a simple single table query; What titles are available in our library?

SELECT title FROM Book;

As we can see the SELECT statement follows the form SELECT <column name> FROM <table name>;. But getting only the values from a single column is not particularly useful. What about data from multiple columns at once?

SELECT title, category FROM Book;

Okay, so we can list multiple columns from the table in comma separated list. But what if we don't want all the rows from the table? What if we only want to see non-fiction books?

SELECT title, category FROM Book WHERE category = 2;

Inner Joins

Great! But displaying the numbers for categories and having to use those numbers for our conditions is annoying. So how do we implement one of those infamous join thingies?

SELECT Book.title, Category.name
FROM Book
JOIN Category ON Category.id = Book.category
WHERE category = 2;

Reasonably obvious! After we've specified the main table to query from using FROM, we can specify which tables to join onto it with a JOIN clause. The JOIN clause specifies which table to join (Category in this case) and which row from the table being joined should be joined to the FROM table, using 'ON' and a condition (for each row in book find all rows in Category with an id equal to the category column in Book, combine the rows, add put them in the result table). Since we are now dealing with multiple tables in a single query we need to prefix column names with the table names they come from, hence 'Category.id' to distinguish it from 'Book.id'.

Performing a join creates a resultant table with all the columns from all the tables joined together. The SELECT statement then extracts only the columns of interest. So while the above query only shows two columns, the join itself produces a table with five columns.

Left Joins

So now we can join tables, but in doing so we exclude certain rows. How do we, for example, obtain a list of all categories and their books, even if there are no books in a category. First let's add such a category to the library database, then we'll perform the query.

INSERT INTO Category (name) VALUES ('Periodical');

SELECT Category.name, Book.title
FROM Category
LEFT JOIN Book ON Book.category = Category.id;

Aha! We can now see all categories, even the new one without any associated books. The results show a row with NULL in the place of the book title for the category 'Periodical'. This is because no row from book could be found to join to the 'Periodical' row from Category, but because we used a LEFT JOIN all rows already present in the left hand side of the join will be preserved and the right side of the join will be filled with NULL's where necessary.

Implicit Inner Joins using WHERE

We can also specify multiple tables in the FROM clause, and their respective join conditions in the WHERE clause. This always implies an INNER JOIN join though (i.e. a normal exclusive JOIN, and not a LEFT JOIN)

SELECT Book.title, Category.name
FROM Book, Category 
WHERE Category.id = Book.category AND category = 2;

Aggregation using GROUP BY

We can also display how many books are in each category.

SELECT Category.name, count(Book.id)
FROM Category
LEFT JOIN Book on Book.category = Category.id
GROUP BY Category.name;

Note the use of the function 'count'. SQL provides a number of aggregation functions which can used to calculate values over multiple rows of selected data. Aggregation functions are always used in conjunction with a GROUP BY clause. If no GROUP BY clause is specified, then the entire selection of rows is treated as a single group, otherwise rows with the same values in the specified column of the GROUP BY clause are grouped together. Each distinct value of the column (or distinct combination of values if multiple columns are specified) are treated as a group. Some commonly used aggregation functions are:

count(): Returns the number of rows in the group.
sum(): Returns the sum of the values of the specified column, which must obviously have a numeric (not NUMERIC) type.
avg(): Returns the average of the values of the specified column, which must obviously have a numeric type.

Sorting with ORDER BY

Finally, we can sort the results eventually displayed.

SELECT Category.name, count(Book.id) as number
FROM Category
LEFT JOIN Book on Book.category = Category.id
GROUP BY Category.name
ORDER BY Category.name;

And as can be seen below, we can specify sub-sorting and whether the sort should be ascending (the default) or descending (DESC)

SELECT Category.name, count(Book.id)
FROM Category
LEFT JOIN Book on Book.category = Category.id
GROUP BY Category.name, Book.title
ORDER BY Category.name DESC, Book.title;

Aliasing

If our table names are long, or we're just lazy, it can become a pain to type out full table names when differentiating column names. SQL allows us to create on the fly aliases for both table names and column names. Observe:

SELECT C.name as category, count(B.id) as number
FROM Category as C
LEFT JOIN Book as B on B.category = C.id
GROUP BY C.name
ORDER BY number;

Here we are dealing with two different types of aliases, although they both have the same syntax. We can, for the scope of the current query, alias either a column name or a table name by following it with as <alias>.

Aliasing column names affects the actual output of the query, in that the column headings printed out (or returned to your program if connecting to the db from python), which are generally automatically assigned to be the names of the columns chosen, are overridden by any alias names provided. As above, 'C.name' will now be named 'category' instead of 'name' in the output. Another use of aliasing is that it provides us with an easy way to reference aggregate columns in 'ORDER BY' or 'WHERE' clauses, as done here with 'number'.

Aliasing tables with 'as' in the FROM and JOIN clauses allows us to use the alias names to specify table differentiation of columns throughout the rest of query.

Dropping a table

We can get rid of an entire table, assuming there are no rows in the table which are still pointed to by foreign keys in other tables of course, by issuing the DROP TABLE <table name>; command. There's no undo, and asking for confirmation on this. Be sure!

Dropping an entire database

Similarly, we can drop an entire database. Again, no confirmation, no undo!

DROP DATABASE <database name>;

Transactions

One other cool thing about relational database management systems (RDBMS) is the concept of atomic transactions. Transactions allow one to group together SQL commands into a single atomic unit, known as a transaction. The commands are executed in sequence, but if one of the commands does not complete successfully, the database is rolled back to the point prior to the transaction starting, i.e. it's all or nothing. This is very useful when we want to insert multiple different rows into different tables that all refer to one another. If the connection goes dead mid insertion process, we don't want half the data there with foreign keys potentially unsatisfied. Instead we'd rather know that the whole thing failed, nothing was changed, and that if we redo from scratch it won't fail because unique keys already exists.

We start a transaction with BEGIN;, and everything entered up until END; becomes part of the transaction. Once the END command is issued, the entire transaction is executed at once. Actually, the commands are executed as you type them in, but if one causes an error, no more commands are valid until the 'END' of the transaction, the all commands will be automatically undone.