This lesson deals with standard structured query language (SQL). SQL is a not a programming language as such, but rather, as its name suggests, a query language. It is a language designed specifically to let one phrase a complex question in a human readable but also machine parsable manner, that can extract arbitrary collection of data from a given database.
The details of database administration are not part of the SQL standard, and as such not covered in this lesson (or this course for that matter). They are different for every implementation of SQL, and covered by each implementation's documentation. For the moment, it is assumed that you have an account on a database already set up. This lesson assumes a PostgreSQL database.
The first thing to do is start up your database client. In a
terminal, or command line window if using windows (Why are you still
using Windows?), type psql
for PostgreSQL:
sirlark@hephaestus ~ $ psql Welcome to psql 8.0.15, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit sirlark=>
You are now connected to the PostgreSQL server, and are using the database named <your username> (in this case sirlark).
For MySQL type mysql -D <your username> -p
:
sirlark@hephaestus ~ $ mysql -D sirlark -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 5 Server version: 5.0.60-log Gentoo Linux mysql-5.0.60-r1 Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql>
While the creation of a new database is nominally an administrative
issue, it is covered by the SQL standard. Quite simply we use the
command CREATE DATABASE <database name>;
. SQL
commands are not case sensitive, though table names and column names may
be depending on implementation, however it is common practice to use
uppercase for SQL specific keywords to make your queries more
legible. Note the trailing semicolon. SQL is insensitive to the amount
and nature of whitespace used, i.e. all whitespace is considered
equivalent. As such a newline does not end a command. Instead all
commands end is a semicolon.
The first thing we want to do with any relational database is to
specify its structure. We do this by creating tables, which initially
contain no rows. The postgres client allows us to list all the tables
present in the database using the command \dt
(The MySQL
equivalent is SHOW TABLES
).
Creating a new table is as simple as:
CREATE TABLE <table name> ( <column name> <type> [<constraints>] [DEFAULT <default value>] [, ... ] );
Like python, relational databases have type, except they enforce the type of data present in a column of a table. Different implementations provide various different types, but the SQL standard specifies a shared collection of types available in all implementations.
There are many more standard SQL types, which you can find in the SQL standard, or in your database software documentation. The above types however will cover most of your bases.
Although every column we specify must have a type, constraints are optional. Constraints are rules we specify that constrain not the type of data that may be stored in a column, but the content of that data. Constraints can be specified in any order. Some common constraint keywords are:
As mentioned before most database software allows the automatic generation of unique primary keys, however different software implementations do this in different ways, and there is no standard way to specify such behaviour in standard SQL.
In PostgreSQL a special column type is provided, called SERIAL. Making a column, not necessarily a primary key, of type SERIAL will assign a default value that starts at 1 and increments by 1 for every row inserted into the table that doesn't specify a value for that column.
CREATE TABLE my_table ( id SERIAL PRIMARY KEY, somestuff TEXT );
The same functionality can be achieved, but only for columns which are primary keys, in MySQL using the AUTO_INCREMENT qualifier (which is essentially a constraint).
CREATE TABLE my_table ( id INTEGER AUTO_INCREMENT PRIMARY KEY, somestuff TEXT );
In addition to specifying column name, type and constraints, one may
specify a default value for cases where rows are inserted without
specifying the column in question. If no default is specified, the value
becomes NULL. Defaults are specified with the in the form DEFAULT
<value>
, generally after any constraints.
Finally, one can specify that an index be created for the table on
the specified column, using the keyword INDEX
. This is
implied by the use of PRIMARY KEY, but other columns can be
indexed for cases where they will be used often in queries. Indexing
speeds up access and querying of the table when conditions using indexed
columns are specified. The actual mechanisms of indexing and how they
work are otherwise not worth going into in detail unless your intent is
to become a professional database software developer.
After all columns have been specified, additional table constraints can be specified. Most constraints can be specified using either column constraints or table constraints, but there are some constraints which are restricted by syntax to being specified only as table constraints. In particular, constraints that span multiple columns, as in the multi-column primary key constraints of link tables used to represent many-to-many relationships. Some of these constraints are:
As an aside, it should be mentioned that as of this writing, MySQL does not enforce column foreign key constraints, but does parse them and ignore them without warning or comment.
Let's get to practical grips with SQL now. Save the following
listing to a file called library.sql, and pipe it into you database
client... psql < library.sql
. If you're using MySQL
you will need to change all instances of 'SERIAL' to 'INTEGER
AUTO_INCREMENT' in library.sql and use this command instead:
mysql -p < library.sql
. This SQL script defines a new
database to deal with the management of a small library, including
searching for books by category, author name, or title, and managing
their withdrawal and return by members.
CREATE TABLE Category ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL ); CREATE TABLE Book ( id SERIAL PRIMARY KEY, title VARCHAR(255) NOT NULL, category INTEGER REFERENCES Category NOT NULL ); CREATE TABLE Author ( id SERIAL PRIMARY KEY, surname VARCHAR(128) NOT NULL, firstname VARCHAR(128) NOT NULL ); CREATE TABLE Authorship ( author INTEGER REFERENCES Author NOT NULL, book INTEGER REFERENCES Book NOT NULL, PRIMARY KEY (author, book) ); CREATE TABLE Member ( id SERIAL PRIMARY KEY, surname VARCHAR(128) NOT NULL, firstname VARCHAR(128) NOT NULL, address TEXT, phone CHAR(12), fax CHAR(12), signup_date TIMESTAMP DEFAULT NOW() ); CREATE TABLE Withdrawal ( id SERIAL PRIMARY KEY, withdrawal TIMESTAMP NOT NULL, return TIMESTAMP NULL, book INTEGER REFERENCES Book NOT NULL, member INTEGER REFERENCES Member NOT NULL );
Of course simply creating a data structure is not helpful unless we can populate it with data. Thus we come to the insertion of rows.
INSERT INTO <table name> (<column name>[, <column name>[, ...]]) VALUES (<value>[, <value>[, ...]]);
Very simply, the INSERT statement inserts a new rows into the table 'table name'. In the newly inserted row, the columns listed by name will be populated with the values listed, by respective position. String and text data needs to be enclosed in single quotes, integers, floats, numerics and boolean need not. Date and times also need to have single quotes and are always in 'YYYY-MM-DD' format for dates, and 'hh:mm:ss' for times, and 'YYYY-MM-DD hh:mm:ss' for timestamps.
Continuing with our library example, let's use the INSERT statement to populate out database with some information.
INSERT INTO Category (name) VALUES ('Fiction'); INSERT INTO Category (name) VALUES ('Non-Fiction'); INSERT INTO Category (name) VALUES ('Reference'); INSERT INTO Book (title, category) VALUES ('Introductory Programming in Python', 2); INSERT INTO Book (title, category) VALUES ('Lesser Known Molluscs of the Karoo Flood Plain', 2); INSERT INTO Book (title, category) VALUES ('Advanced Programming Techniques', 2); INSERT INTO Book (title, category) VALUES ('PostgreSQL Reference Manual', 3); INSERT INTO Book (title, category) VALUES ('Pride & Prejudice', 1); INSERT INTO Book (title, category) VALUES ('The Time Machine', 1); INSERT INTO Book (title, category) VALUES ('Day of the Triffids', 1); INSERT INTO Book (title, category) VALUES ('The Bogus Book', 1); INSERT INTO Author (surname, firstname) VALUES ('Dominy', 'James'); INSERT INTO Author (surname, firstname) VALUES ('Wells', 'Herbert George'); INSERT INTO Author (surname, firstname) VALUES ('Wyndham', 'John'); INSERT INTO Author (surname, firstname) VALUES ('Snailman', 'Fred B'); INSERT INTO Author (surname, firstname) VALUES ('Newman', 'Alfred E'); INSERT INTO Author (surname, firstname) VALUES ('Austen', 'Jane'); INSERT INTO Authorship (author, book) VALUES (1, 1); INSERT INTO Authorship (author, book) VALUES (1, 2); INSERT INTO Authorship (author, book) VALUES (5, 2); INSERT INTO Authorship (author, book) VALUES (1, 3); INSERT INTO Authorship (author, book) VALUES (5, 4); INSERT INTO Authorship (author, book) VALUES (6, 5); INSERT INTO Authorship (author, book) VALUES (2, 6); INSERT INTO Authorship (author, book) VALUES (3, 7);
We can modify the contents of rows that already exist in a table using the UPDATE statement.
UPDATE <table name> SET <column name> = <value>[, <column name> = <value>[,...]] [WHERE <condition>];
The UPDATE statements updates the contents of
all rows in the specified table, unless a WHERE clause
is given, in which case only rows that match the condition are updated.
Usually the condition will be of the form WHERE prikey = <some
id>
, but more general updates can be useful. For example one
might want to update all rows in a table that have a date column with
values before 5 years ago, such that their 'archived' column becomes
true, because they have cycled out of the active tax storage
requirements.
By the way, in our example library database, one of the authorship records is wrong, so we should update it. James Dominy knows nothing about molluscs, and Fred Snailman is in fact the correct author.
UPDATE Authorship SET author = 4 WHERE author = 1 AND book = 2;
Obviously, over time we will need to delete records from out database. Again, simply done with the DELETE statement.
DELETE FROM <table name> [WHERE <condition>];
Note that the DELETE statement deletes all rows from the table unless a WHERE clause is provided, in which case only those rows matching the specified condition are removed. Rows cannot be removed if they are referred to by a foreign key in another table, i.e. removal is blocked if such a removal would cause a violation of the foreign key constraints of the database.
Let's get rid of the bogus book in our example library database.
DELETE FROM Book WHERE id = 8;
The true power behind SQL is the ability to dynamically specify what information we want from the database. This is done using the SELECT statement. The SELECT statement is possibly the most complex statement in the SQL language, so we'll deal with it by example, starting simple, and expanding from there.
We'll start out with a simple single table query; What titles are available in our library?
SELECT title FROM Book;
As we can see the SELECT statement follows the form
SELECT <column name> FROM <table name>;
. But
getting only the values from a single column is not particularly useful.
What about data from multiple columns at once?
SELECT title, category FROM Book;
Okay, so we can list multiple columns from the table in comma separated list. But what if we don't want all the rows from the table? What if we only want to see non-fiction books?
SELECT title, category FROM Book WHERE category = 2;
Great! But displaying the numbers for categories and having to use those numbers for our conditions is annoying. So how do we implement one of those infamous join thingies?
SELECT Book.title, Category.name FROM Book JOIN Category ON Category.id = Book.category WHERE category = 2;
Reasonably obvious! After we've specified the main table to query from using FROM, we can specify which tables to join onto it with a JOIN clause. The JOIN clause specifies which table to join (Category in this case) and which row from the table being joined should be joined to the FROM table, using 'ON' and a condition (for each row in book find all rows in Category with an id equal to the category column in Book, combine the rows, add put them in the result table). Since we are now dealing with multiple tables in a single query we need to prefix column names with the table names they come from, hence 'Category.id' to distinguish it from 'Book.id'.
Performing a join creates a resultant table with all the columns from all the tables joined together. The SELECT statement then extracts only the columns of interest. So while the above query only shows two columns, the join itself produces a table with five columns.
So now we can join tables, but in doing so we exclude certain rows. How do we, for example, obtain a list of all categories and their books, even if there are no books in a category. First let's add such a category to the library database, then we'll perform the query.
INSERT INTO Category (name) VALUES ('Periodical'); SELECT Category.name, Book.title FROM Category LEFT JOIN Book ON Book.category = Category.id;
Aha! We can now see all categories, even the new one without any associated books. The results show a row with NULL in the place of the book title for the category 'Periodical'. This is because no row from book could be found to join to the 'Periodical' row from Category, but because we used a LEFT JOIN all rows already present in the left hand side of the join will be preserved and the right side of the join will be filled with NULL's where necessary.
We can also specify multiple tables in the FROM clause, and their respective join conditions in the WHERE clause. This always implies an INNER JOIN join though (i.e. a normal exclusive JOIN, and not a LEFT JOIN)
SELECT Book.title, Category.name FROM Book, Category WHERE Category.id = Book.category AND category = 2;
We can also display how many books are in each category.
SELECT Category.name, count(Book.id) FROM Category LEFT JOIN Book on Book.category = Category.id GROUP BY Category.name;
Note the use of the function 'count'. SQL provides a number of aggregation functions which can used to calculate values over multiple rows of selected data. Aggregation functions are always used in conjunction with a GROUP BY clause. If no GROUP BY clause is specified, then the entire selection of rows is treated as a single group, otherwise rows with the same values in the specified column of the GROUP BY clause are grouped together. Each distinct value of the column (or distinct combination of values if multiple columns are specified) are treated as a group. Some commonly used aggregation functions are:
Finally, we can sort the results eventually displayed.
SELECT Category.name, count(Book.id) as number FROM Category LEFT JOIN Book on Book.category = Category.id GROUP BY Category.name ORDER BY Category.name;
And as can be seen below, we can specify sub-sorting and whether the sort should be ascending (the default) or descending (DESC)
SELECT Category.name, count(Book.id) FROM Category LEFT JOIN Book on Book.category = Category.id GROUP BY Category.name, Book.title ORDER BY Category.name DESC, Book.title;
If our table names are long, or we're just lazy, it can become a pain to type out full table names when differentiating column names. SQL allows us to create on the fly aliases for both table names and column names. Observe:
SELECT C.name as category, count(B.id) as number FROM Category as C LEFT JOIN Book as B on B.category = C.id GROUP BY C.name ORDER BY number;
Here we are dealing with two different types of aliases, although
they both have the same syntax. We can, for the scope of the current
query, alias either a column name or a table name by following it with
as <alias>
.
Aliasing column names affects the actual output of the query, in that the column headings printed out (or returned to your program if connecting to the db from python), which are generally automatically assigned to be the names of the columns chosen, are overridden by any alias names provided. As above, 'C.name' will now be named 'category' instead of 'name' in the output. Another use of aliasing is that it provides us with an easy way to reference aggregate columns in 'ORDER BY' or 'WHERE' clauses, as done here with 'number'.
Aliasing tables with 'as' in the FROM and JOIN clauses allows us to use the alias names to specify table differentiation of columns throughout the rest of query.
We can get rid of an entire table, assuming there are no rows in the
table which are still pointed to by foreign keys in other tables of
course, by issuing the DROP TABLE <table name>;
command. There's no undo, and asking for confirmation on this. Be
sure!
Similarly, we can drop an entire database. Again, no confirmation, no undo!
DROP DATABASE <database name>;
One other cool thing about relational database management systems (RDBMS) is the concept of atomic transactions. Transactions allow one to group together SQL commands into a single atomic unit, known as a transaction. The commands are executed in sequence, but if one of the commands does not complete successfully, the database is rolled back to the point prior to the transaction starting, i.e. it's all or nothing. This is very useful when we want to insert multiple different rows into different tables that all refer to one another. If the connection goes dead mid insertion process, we don't want half the data there with foreign keys potentially unsatisfied. Instead we'd rather know that the whole thing failed, nothing was changed, and that if we redo from scratch it won't fail because unique keys already exists.
We start a transaction with BEGIN;
, and everything
entered up until END;
becomes part of the transaction. Once
the END command is issued, the entire transaction is executed
at once. Actually, the commands are executed as you type them in, but if
one causes an error, no more commands are valid until the 'END' of the
transaction, the all commands will be automatically undone.