University of London / MSc Computer Science: Data management(前半)

University of London / MSc Computer Science: Data management(前半)

August 13, 2023

ロンドン大学で MSc Computer Science: Data management モジュールを履修中。

講義内容に関して記録した個人的なスタディノートです。

全 12 週のうち 1〜5 週目の内容を記録します。(1 週目開始:2023 年 7 月 10 日 / 5 週目終了:2023 年 8 月 13 日)

モジュール概要 #

Content

This module covers the principles and application of data management technologies and languages including SQL. Students study the use of these in leading commercial database management systems as well as emerging approaches to data management. The module also examines the technologies underlying modern data management systems. It studies advanced aspects of query processing, transaction management, distributed data management, and recent developments in web data, big data and alternative database architectures. Data Management represents a core technology in the theory and application of Computer Science.

Aims

The primary aims of this module are:

  • to study the principles and application of data management technology, and
  • to study advanced aspects of databases and recent advances in data management technologies in three major directions: performance, distribution of data, and heterogeneity of data.

Weeks

講義は Week 10 まで。Week 11, 12 は最終課題を作成する期間。

  • Week 1: Introduction to Database Management Systems (DBMS) and the Relational Model
  • Week 2: Relational Algebra
  • Week 3: Introduction to querying using SQL
  • Week 4: SQL subqueries, operators and functions
  • Week 5: SQL Group By queries, NULL values
  • Week 6: Host Language support in SQL and Transaction Management 1 – Recovery and Logging
  • Week 7: Transaction Management 2 – Concurrency
  • Week 8: Database design, Normalisation and SQL Data Definition Language
  • Week 9: DBMS architectures, implementations and Query Optimisation
  • Week 10: Enhanced database capabilities, non-relational DBMS and Distributed Databases

参考文書 #

Essential reading

  • Ricardo, C.M., S.D. Urban and an O’Reilly Media Company Safari Databases illuminated. (Jones & Bartlett Learning, 2015) 3rd edition.
  • Date, C.J. and an O’Reilly Media Company Safari Database in depth: relational theory for practitioners. (O’Reilly Media, Inc., 2005) 1st edition.
  • Foster, E.C., S.V. Godbole and an O’Reilly Media Company Safari Database systems: a pragmatic approach. (Apress, 2014) 1st edition.
  • Beaulieu, A. and an O’Reilly Media Company Safari Learning SQL: generate, manipulate, and retrieve data. (O’Reilly Media, Inc., 2005) 1st edition.
  • Singh, S.K. and an O’Reilly Media Company Safari Database systems: concepts, designs and application. (Pearson India, 2009) 1st edition.
  • Atzeni, P., C.S. Jensen, G. Orsi, S. Ram, L. Tanca and R. Torlone ‘The relational model is dead, SQL is dead, and I don’t feel so good myself’, SIGMOD Record 42(2) 2013, pp.64–68.

Week 1: Introduction to Database Management Systems (DBMS) and the Relational Model #

The history of databases and DBMS #

History of databases #

The first computer-based databases came into existence in the 1960s, initially using hierarchical model (IMS, Information Management System) and network model (Codasyl, Conference on data systems languages).

The relational model, which underpins many of today’s most popular database management systems (DBMS) (e.g. Oracle, IBM DB2, Microsoft SQLServer, MySQL, PostgreSQL) was proposed by Edgar F. Codd in 1970.

However, commercial systems using the relational model did not become prevalent until the 1980s, when computing hardware that was powerful enough became available.

Although relational DBMS continues to dominate the market, from the late 2000s a new generation of non-relational databases (sometimes known as NoSQL or NewSQL DBMS) began to begin to compete with relational DBMS.

A database system is:

‘… a computerized system whose overall purpose is to store information and to allow users to retrieve and update that information on demand.’

Date, C.J. “An introduction to database systems.” (Pearson, 2003)

参考

Approaches to database modelling #

  • Relational model
  • Entity-Relationship (EAR or ER) model
  • Hierarchical
  • CODASYL
  • Network
  • Object-oriented

The Relational Model #

(省略。主キーや外部キー、NULL 値などについての講義。)

Week 2: Relational Algebra #

Introduction to Relational Algebra #

現在、言及されることが多い関係代数の演算子としては、和、差、交差、直積、選択(制限)、射影、結合、商の 8 種類

Manipulating database is considered using an algebraic approach. Relational algebra consists of two groups of operators.

  1. Traditional set operators.
    1. UNION(和)
    2. DIFFERENCE(差)
    3. INTERSECTION(交差)
    4. CARTESIAN PRODUCT(直積)
  2. Relational special operators. These are specific to relational algebra.
    1. SELECTION or RESTRICTION(選択または制限)
    2. PROJECTION(射影)
    3. JOIN(結合)
    4. DIVISION(商)

One thing to emphasize is that Select Statement of the Structured Query Language is not the same as a relational algebra selection.

Relational Algebra notations #

There is no standard syntax for the relational algebra operators. In this module, the convention below are used.

OPERATOR          : SYNTAX
---------------------------------------
UNION             : A ∪ B
INTERSECTION      : A ∩ B
DIFFERENCE        : A - B
CARTESIAN PRODUCT : A × B
SELECTION         : σ (condition)(A)
PROJECTION        : Π attribute-list(A)
NATURAL JOIN      : A ⋈ B
DIVISION          : A ÷ B

Relational Algebra operators #

UNION

# Table: BLUE_PART
P#  PNAME
---------
P3  SCREW
P5  CAM

# Table: PARIS_PART
P#  PNAME
---------
P2  BOLT
P5  CAM
R1 := BLUE_PART ∪ PARIS_PART

# Result in the relation R1
P#  PNAME
---------
P3  SCREW
P5  CAM   # NOTE: The duplicate tuples is eliminated.
P2  BOLT

INTERSECTION

# Table: BLUE_PART
P#  PNAME
---------
P3  SCREW
P5  CAM

# Table: PARIS_PART
P#  PNAME
---------
P2  BOLT
P5  CAM
R2 := BLUE_PART ∩ PARIS_PART

# Result in the relation R2
P#  PNAME
---------
P5  CAM

DIFFERENCE

# Table: BLUE_PART
P#  PNAME
---------
P3  SCREW
P5  CAM

# Table: PARIS_PART
P#  PNAME
---------
P2  BOLT
P5  CAM
R3 := BLUE_PART - PARIS_PART

# Result in the relation R3
P#  PNAME
---------
P3  SCREW

CARTESIAN PRODUCT

# Table: BLUE_PART
P#  PNAME
---------
P3  SCREW
P5  CAM

# Table: ALL_SUPPLIER
S#  SNAME
---------
S1  SMITH
S2  JONES
S3  BLAKE
R4 := BLUE_PART × ALL_SUPPLIER

# Result in the relation R4
P#  PNAME  S#  SNAME
--------------------
P3  SCREW  S1  SMITH
P5  CAM    S1  SMITH
P3  SCREW  S2  JONES
P5  CAM    S2  JONES
P3  SCREW  S3  BLAKE
P5  CAM    S3  BLAKE

SELECTION

SELECTION is a horizontal subset.

σ (CITY = 'PARIS')(SUPPLIER)

S#  SNAME  STATUS  CITY
------------------------
S2  JONES  10      PARIS
S3  BLAKE  30      PARIS



σ (STATUS > 15)(SUPPLIER)

S#  SNAME  STATUS  CITY
-------------------------
S1  JONES  20      LONDON
S3  BLAKE  30      PARIS



σ (CITY = 'PARIS' ∧ STATUS < 15)(SUPPLIER)

S#  SNAME  STATUS  CITY
------------------------
S2  JONES  10      PARIS

PROJECTION

PROJECTION is a vertical subset.

Π CITY(SUPPLIER)

CITY
------
LONDON
PARIS



Π CITY,COLOR(PART)

CITY    COLOR
-------------
LONDON  RED
PARIS   GREEN
ROME    BLUE
PARIS   BLUE

JOIN

Whereas the result of a Cartesian product has all combinations of tuples from the operand relations, the result of a join has only certain combinations of tuples from the operand relations.

# Table: SMALL_S
S#  SNAME
---------
S1  SMITH
S2  JONES

# Table: SMALL_SP
S#  P#
------
S2  P1
S3  P2
# NATURAL JOIN(⋈)of relation A and relation B is defined as an equal join of
# A and B, with equality holding for each attribute found in both A and B,
# followed by a projection to remove the duplicate attributes arising from B.

SMALL_S ⋈ SMALL_SP

# Results
S#  SNAME  P#
-------------
S2  JONES  P1

DIVISION

# Table: SUPPLY2
S#  P#  J#
----------
S1  P1  J1
S1  P1  J2
S1  P1  J3
S2  P3  J1
S2  P3  J2
S2  P4  J1
S2  P4  J3
S3  P4  J1

# Table: PROJECT2
J#
--
J1
J2
SUPPLY2 ÷ PROJECT2

# -> PROJECT2のすべての値(= J1 および J2)を持っている S# P# の組み合わせを返す

# Results
S#  P#
------
S1  P1
S2  P3

NESTING of relational algebra expressions

Π SNAME(σ(J# = 'J4')(SUPPLIER ⋈ SUPPLY))

# 👆 means
TEMP1  := SUPPLIER ⋈ SUPPLY
TEMP2  := σ(J# = 'J4')(TEMP1)
RESULT := Π SNAME(TEMP2)
### Other examples ###

# Find the colors of parts supplied by supplier S5 to project J4.
Π COLOR((σ(S# = 'S5' ∧ J# = 'J4')(SUPPLY)) ⋈ PART)

# Find the names of parts supplied by Smith.
Π PNAME((Π P#(σ(SNAME = 'SMITH')(SUPPLIER) ⋈ SUPPLY)) ⋈ PART)

Practice: Relational Algebra #

Consider a database with the following relations:

  • Country (CoName)
  • EU (CoName)
  • NATO (CoName)
  • Capital (CoName, City, Population)
  • Area (CoName, Size)
  • Language (CoName, Lang)
  • Border (CoName1, CoName2)

Relation ‘Country’ records the name of each country, whilst relations ‘EU’ and ‘NATO’ record the names of ‘EU’ and ‘NATO’ countries respectively. The capital (and its population), size in square miles, and languages spoken are also recorded for each country. Relation ‘Border’ records the countries bordering each country, the information being held symmetrically in the sense that if the relation contains a tuple such as (France, Germany) it also holds the tuple (Germany, France).

Write relational algebra expressions to find:

  • (a) the language which are spoken,
  • (b) the countries which are either in NATO or the EU,
  • (c) the countries which are both in NATO or the EU,
  • (d) the countries which are in NATO but not the EU,
  • (e) the countries which has an its capital ‘Vaduz’,
  • (f) the capitals of countries with an area of less than 4000 square miles,

Answers

(a)
Π Lang(Language)

(b)
NATO ∪ EU

(c)
NATO ∩ EU

(d)
NATO - EU

(e)
Π CoName(σ(CITY = 'Vaduz')(Capital))

(f)
Π City((σ(Size < 4000)(Area)) ⋈ Capital)

Week 3: Introduction to querying using SQL #

What is SQL and a brief history of SQL #

Structured Query Language (SQL) History

  • Database language developed by IBM in the 1970s.
  • Relational Database Management Systems (DBMS) products supporting SQL first appeared in the late 1970s and early 1980s e.g. Oracle, DB2.
  • Other suppliers subsequently developed their own SQL-based relational DBMS products e.g. Microsoft SQL Server.
  • Open Source platforms such as MySQL and PostgreSQL supporting SQL.
  • The American National Standards Institute, ANSI, adopted the first version of the standard in 1986 and the International Organization for Standardization, ISO, in 1987.
  • Then standard has evolved regularly.

What is SQL

SQL (often pronounced SEQUEL) is a special-purpose programming language designed for manipulating data in a relational database management system.

SQL is combined data manipulation language(DML) and data definition language(DDL).

  • DML: is used to retrieve and modify data in relational tables.
  • DDL: is used to create and modify table structures, views, and so on.

SQL: Select - Single Table queries #

The basic format of the ‘SELECT’ statement is:

SELECT [ALL | DISTINCT] <select-list>
FROM <table-list>
[ WHERE <conditional-expression> ]
[ GROUP BY <column-list> ]
[ HAVING <conditional-expression> ]
[ ORDER BY <order-list> ]

If we are in strict relational algebra sense, when duplicated set is found in selected data, it would have eliminated because we can’t have duplicates in a set. SQL is implemented as what we call a bag language (a bag is a set that allows duplicates) as opposed to a set language. A bag essentially has all the same properties as a set except that duplicates are allowed. In SQL duplicates rows are not eliminated by default. If you need it, specify ‘DISTINCT’ qualifier.

SQL: Select - Multiple Table queries #

-- Fond supplier names for suppliers show supply part P2

-- Alternative A
SELECT DISTINCT SNAME
FROM SUPPLIER, SUPPLY
WHERE SUPPLIER.SIP = SUPPLY.SID
AND SUPPLY.PID = 'P2';

-- Alternative B: use (uncorrelated) sub-query
SELECT SNAME
FROM SUPPLIER
WHERE SID IN (SELECT SID
              FROM SUPPLY
              WHERE PID = 'P2');

-- Alternative C: use correlated sub-query
SELECT SNAME
FROM SUPPLIER
WHERE 'P2' IN (SELECT PID
               FROM SUPPLY
               WHERE SID = SUPPLIER.SID); -- referencing from the outer query

-- 👆 Alternative B's sub-query runs once
-- 👆 Alternative C's correlated sub-query runs for each row of the outer query.
-- Find supplier names for suppliers who supply a red part.

-- Alternative A: nested
SELECT SNAME
FROM SUPPLIER
WHERE SID IN
   (SELECT SID
    FROM SUPPLY
    WHERE PID IN
      (SELECT PID
       FROM PART
       WHERE COLOR = 'RED'));

-- Alternative B: non-nested
SELECT DISTINCT SNAME
FROM SUPPLIER, SUPPLY, PART
WHERE SUPPLIER.SID = SUPPLY.SID
AND PART.PID = SUPPLY.PID
AND PART.COLOR = 'RED';

CRUD #

The CRUD acronym is represented in the following table:

  • Create: equals the INSERT SQL keyword
    • INSERT for inserting data into the specified table
  • Read: equals the SELECT SQL keyword
    • SELECT for getting data from the specified table
  • Update: equals the UPDATE SQL keyword
    • UPDATE for updating existing data in the specified table
  • Delete: equals the DELETE SQL keyword
    • DELETE for deleting data from the specified table

DDL & DML #

It is important not to confuse the DDL keywords with the DML keywords.

Data Definition Language (DDL)

The keywords that let us CREATE, DROP (remove), SHOW, ALTER and in general, manipulate databases and tables are commonly associated with the DDL term.

Data Manipulation Language (DML)

DML applies only to the group of 4 keywords (the CRUD keywords) that directly manipulate database data.

For example, in the SQL language, the SELECT, INSERT, UPDATE and DELETE keywords, and its correspondent clauses and operators.

Practice: DDL/DML with MySQL #

mysql

CREATE DATABASE famous_scientists;

SHOW DATABASES;

/*
+-------------------+
| Database          |
+-------------------+
| famous_scientists |
+-------------------+
*/

USE famous_scientists;

CREATE TABLE scientists(
   id INT(1) NOT NULL AUTO_INCREMENT,
   name VARCHAR(255) NOT NULL,
   discovery TEXT NOT NULL,
   year_of_birth INT(4) NOT NULL,
   year_of_death INT(4) DEFAULT NULL,
   PRIMARY KEY(id)
) AUTO_INCREMENT = 1;

SHOW TABLES;

/*
+-----------------------------+
| Tables_in_famous_scientists |
+-----------------------------+
| scientists                  |
+-----------------------------+
*/

SHOW COLUMNS FROM scientists;

/*
+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| id            | int(1)       | NO   | PRI | NULL    | auto_increment |
| name          | varchar(255) | NO   |     | NULL    |                |
| discovery     | text         | NO   |     | NULL    |                |
| year_of_birth | int(4)       | NO   |     | NULL    |                |
| year_of_death | int(4)       | YES  |     | NULL    |                |
+---------------+--------------+------+-----+---------+----------------+
*/

INSERT INTO scientists (name, discovery, year_of_birth, year_of_death)
VALUES
("Galileo Galilei", "Modern observational astronomy", 1564, 1642),
("Geoffrey Hinton", "Artificial neural network", 1947, default);

SELECT * FROM scientists;

/*
+----+-----------------+--------------------------------+---------------+---------------+
| id | name            | discovery                      | year_of_birth | year_of_death |
+----+-----------------+--------------------------------+---------------+---------------+
|  1 | Galileo Galilei | Modern observational astronomy |          1564 |          1642 |
|  2 | Geoffrey Hinton | Artificial neural network      |          1947 |          NULL |
+----+-----------------+--------------------------------+---------------+---------------+
*/

Week 4: SQL subqueries, operators and functions #

Subqueries and correlation names #

/*
The correlation name SPX is used to unambiguously qualify column references
in the SELECT statement, when the same table is used more than once.

Once a correlation name has been introduced,
the table name itself may not be used within the SELECT statement
as a qualifier to reference a column in the aliased table.
*/

-- Find part ids for parts supplied by more than one supplier.

SELECT DISTINCT PID
FROM SUPPLY SPX
WHERE PID IN (SELECT PID FROM SUPPLY
              WHERE SID <> SPX.SID);

SELECT DISTINCT PID
FROM SUPPLY SPX
WHERE EXISTS (SELECT * FROM SUPPLY -- or SELECT 1 FROM SUPPLY
              WHERE SID <> SPX.SID);

SELECT DISTINCT SPX.PID
FROM SUPPLY SPX, SUPPLY SPY
WHERE SPY.PID = PSX.PID
AND SPY.SID <> SPX.SID;

The MINUS operator in SQL #

-- Find the project ids and names for projects not supplied by supplier S4.

SELECt * FROM PROJECT;
/*
+-----+---------+--------+
| JID | JNAME   | CITY   |
+-----+---------+--------+
| J1  | SORTER  | PARIS  |
| J2  | DISPLAY | ROME   |
| J3  | OCR     | ATHENS |
| J4  | CONSOLE | ATHENS |
| J5  | RAID    | LONDON |
| J6  | EDS     | OSLO   |
| J7  | TAPE    | LONDON |
+-----+---------+--------+
*/

SELECt * FROM SUPPLY;
/*
+-----+-----+-----+----------+
| SID | PID | JID | QUANTITY |
+-----+-----+-----+----------+
| S1  | P1  | J1  |      200 |
| S1  | P1  | J4  |      700 |
| S2  | P3  | J1  |      400 |
| S2  | P3  | J2  |      200 |
| S2  | P3  | J3  |      200 |
| S2  | P3  | J4  |      500 |
| S2  | P3  | J5  |      600 |
| S2  | P3  | J6  |      400 |
| S2  | P3  | J7  |      800 |
| S2  | P5  | J2  |      100 |
| S3  | P3  | J1  |      200 |
| S3  | P4  | J2  |      500 |
| S4  | P6  | J3  |      300 |
| S4  | P6  | J7  |      300 |
| S5  | P1  | J4  |      100 |
| S5  | P2  | J2  |      200 |
| S5  | P2  | J4  |      100 |
| S5  | P3  | J4  |      200 |
| S5  | P4  | J4  |      800 |
| S5  | P5  | J4  |      400 |
| S5  | P5  | J5  |      500 |
| S5  | P5  | J7  |      100 |
| S5  | P6  | J2  |      200 |
| S5  | P6  | J4  |      500 |
+-----+-----+-----+----------+
*/

/*
Expected result:
J3 and J7 is not included.

+-----+---------+
| JID | JNAME   |
+-----+---------+
| J1  | SORTER  |
| J2  | DISPLAY |
| J4  | CONSOLE |
| J5  | RAID    |
| J6  | EDS     |
+-----+---------+
*/
-- [common INCORRECT approach!!]
-- This gives projects that are supplied by a supplier other than S4,
-- but such a project could also be supplied by supplier S4.
-- This query therefore incorrectly gives all projects.
SELECT DISTINCT PROJECT.JID, PROJECT.JNAME
FROM PROJECT, SUPPLY
WHERE PROJECT.JID = SUPPLY.JID
AND SUPPLY.SID <> 'S4';
/*
+-----+---------+
| JID | JNAME   |
+-----+---------+
| J1  | SORTER  |
| J2  | DISPLAY |
| J3  | OCR     |
| J4  | CONSOLE |
| J5  | RAID    |
| J6  | EDS     |
| J7  | TAPE    |
+-----+---------+
*/



-- CORRECT approach 1 (NOT IN):
SELECT DISTINCT PROJECT.JID, PROJECT.JNAME
FROM PROJECT
WHERE JID NOT IN (SELECT JID
                  FROM SUPPLY
                  WHERE SID = 'S4');

-- CORRECT approach 2 (NOT EXISTS):
SELECT DISTINCT PROJECT.JID, PROJECT.JNAME
FROM PROJECT
WHERE NOT EXISTS (SELECT *
                  FROM SUPPLY
                  WHERE SUPPLY.JID = PROJECT.JID AND SID = 'S4');

-- CORRECT approach 3 (EXCEPT):
-- ⚠️ EXCEPT is a part of the SQL standard but Oracle doesn't support EXCEPT operator.
SELECT DISTINCT JID, JNAME
FROM PROJECT
EXCEPT
SELECT PROJECT.JID, PROJECT.JNAME
FROM PROJECT, SUPPLY
WHERE SUPPLY.JID = PROJECT.JID AND SID = 'S4';

The INTERSECTION operator in SQL #

-- Find part ids for parts supplied by supplier S2 and S3.

SELECT * FROM SUPPLY;
/*
+-----+-----+-----+----------+
| SID | PID | JID | QUANTITY |
+-----+-----+-----+----------+
| S1  | P1  | J1  |      200 |
| S1  | P1  | J4  |      700 |
| S2  | P3  | J1  |      400 |
| S2  | P3  | J2  |      200 |
| S2  | P3  | J3  |      200 |
| S2  | P3  | J4  |      500 |
| S2  | P3  | J5  |      600 |
| S2  | P3  | J6  |      400 |
| S2  | P3  | J7  |      800 |
| S2  | P5  | J2  |      100 |
| S3  | P3  | J1  |      200 |
| S3  | P4  | J2  |      500 |
| S4  | P6  | J3  |      300 |
| S4  | P6  | J7  |      300 |
| S5  | P1  | J4  |      100 |
| S5  | P2  | J2  |      200 |
| S5  | P2  | J4  |      100 |
| S5  | P3  | J4  |      200 |
| S5  | P4  | J4  |      800 |
| S5  | P5  | J4  |      400 |
| S5  | P5  | J5  |      500 |
| S5  | P5  | J7  |      100 |
| S5  | P6  | J2  |      200 |
| S5  | P6  | J4  |      500 |
+-----+-----+-----+----------+
*/

/*
Expected result: just P3.

+-----+
| PID |
+-----+
| P3  |
+-----+
*/
-- [common INCORRECT approach!!]
-- The WHERE cause evaluates for every row from the FROM clause.
-- For each row any attribute must be single valued.
-- Therefore the WHERE clause expression will evaluate fo False for every row,
-- and the result set will therefore always be empty.
SELECT DISTINCT PID
FROM SUPPLY
WHERE SID = 'S2' AND SID = 'S3';
/*
(Empty set)
*/



-- CORRECT approach 1 (self join with IN):
SELECT DISTINCT SPX.PID
FROM SUPPLY SPX
WHERE SID = 'S2' AND PID IN (SELECT PID
                             FROM SUPPLY SPY
                             WHERE SPY.SID = 'S3');

-- CORRECT approach 2 (self join without a subquery):
SELECT DISTINCT SPX.PID
FROM SUPPLY SPX, SUPPLY
WHERE SPX.PID = SPY.PID
AND SPX.SID = 'S2'
AND SPY.SID = 'S3';

-- CORRECT approach 3 (INTERSECT):
SELECT DISTINCT PID
FROM SUPPLY
WHERE SID = 'S2'
INTERSECT
SELECT DISTINCT PID
FROM SUPPLY
WHERE SID = 'S3';

The DIVIDE BY operator in SQL #

-- Find supplier names for suppliers who supply all the parts.

SELECT * FROM SUPPLIER;
/*
+-----+-------+--------+--------+
| SID | SNAME | STATUS | CITY   |
+-----+-------+--------+--------+
| S1  | SMITH |     20 | LONDON |
| S2  | JONES |     10 | PARIS  |
| S3  | BLAKE |     30 | PARIS  |
| S4  | CLARK |     20 | LONDON |
| S5  | ADAMS |     30 | ATHENS |
+-----+-------+--------+--------+
*/

SELECT * FROM PART;
/*
+-----+-------+-------+--------+--------+
| PID | PNAME | COLOR | WEIGHT | CITY   |
+-----+-------+-------+--------+--------+
| P1  | NUT   | RED   |     12 | LONDON |
| P2  | BOLT  | GREEN |     17 | PARIS  |
| P3  | SCREW | BLUE  |     17 | ROME   |
| P4  | SCREW | RED   |     14 | LONDON |
| P5  | CAM   | BLUE  |     12 | PARIS  |
| P6  | COG   | RED   |     19 | LONDON |
+-----+-------+-------+--------+--------+
*/

SELECT * FROM SUPPLY;
/*
+-----+-----+-----+----------+
| SID | PID | JID | QUANTITY |
+-----+-----+-----+----------+
| S1  | P1  | J1  |      200 |
| S1  | P1  | J4  |      700 |
| S2  | P3  | J1  |      400 |
| S2  | P3  | J2  |      200 |
| S2  | P3  | J3  |      200 |
| S2  | P3  | J4  |      500 |
| S2  | P3  | J5  |      600 |
| S2  | P3  | J6  |      400 |
| S2  | P3  | J7  |      800 |
| S2  | P5  | J2  |      100 |
| S3  | P3  | J1  |      200 |
| S3  | P4  | J2  |      500 |
| S4  | P6  | J3  |      300 |
| S4  | P6  | J7  |      300 |
| S5  | P1  | J4  |      100 |
| S5  | P2  | J2  |      200 |
| S5  | P2  | J4  |      100 |
| S5  | P3  | J4  |      200 |
| S5  | P4  | J4  |      800 |
| S5  | P5  | J4  |      400 |
| S5  | P5  | J5  |      500 |
| S5  | P5  | J7  |      100 |
| S5  | P6  | J2  |      200 |
| S5  | P6  | J4  |      500 |
+-----+-----+-----+----------+
*/

/*
Expected result:

+-------+
| SNAME |
+-------+
| ADAMS |
+-------+
*/
-- CORRECT approach:
SELECT S.SNAME
FROM SUPPLIER S
WHERE NOT EXISTS (SELECT *
                  FROM PART P
                  WHERE NOT EXISTS(SELECT *
                                  FROM SUPPLY SPJ
                                  WHERE SPJ.SID = S.SID
                                  AND SPJ.PID = P.PID));

-- When we use GROUP BY HAVING clauses, we will see an alternative approach to representing above.
-- Many people find this alternative approach is more intuitive.

Functions #

Five basic aggregate functions are:

  1. COUNT
  2. MAX
  3. MIN
  4. SUM
  5. AVG
-- Find the total quantity of red parts supplied.

SELECT SUM(SPJ.QUANTITY)
FROM SUPPLY SPJ
WHERE SPJ.PID IN (SELECT P.PID
                  FROM PART P
                  WHERE P.COLOR = 'RED');
/*
+-------------------+
| SUM(SPJ.QUANTITY) |
+-------------------+
|              3600 |
+-------------------+
*/

Operators #

In addition to the five aggregate functions, a number of other standard functions and operators exist. These include

  • arithmetic operators: + - * /
  • a string concatenation operator: ||

Individual relational DBMS products typically support a wide range of non-standard operators.

-- Find the supplier, part and project ids, and total weight of the parts supplied
-- where the total weight is the maximum for any supply instance.

SELECT SPJ.SID, SPJ.PID, SPJ.JID, P.WEIGHT * SPJ.QUANTITY AS "TOTAL WEIGHT"
FROM SUPPLY SPJ, PART P
WHERE SPJ.PID = P.PID
AND P.WEIGHT * SPJ.QUANTITY = (SELECT MAX(PX.WEIGHT * SPJX.QUANTITY)
                               FROM SUPPLY SPJX, PART PX
                               WHERE SPJX.PID = PX.PID);
/*
+-----+-----+-----+--------------+
| SID | PID | JID | TOTAL WEIGHT |
+-----+-----+-----+--------------+
| S2  | P3  | J7  |        13600 |
+-----+-----+-----+--------------+
*/

Week 5: SQL GROUP BY queries, NULL values #

The GROUP BY and HAVING clauses #

-- For each part supplied, find the part id and the total quantity supplied of that part.

SELECT PID, SUM(QUANTITY)
FROM SUPPLY
GROUP BY PID;

/*
+-----+---------------+
| PID | SUM(QUANTITY) |
+-----+---------------+
| P1  |          1000 |
| P2  |           300 |
| P3  |          3500 |
| P4  |          1300 |
| P5  |          1100 |
| P6  |          1300 |
+-----+---------------+
*/



-- Find supplier numbers for suppliers who supply more than two parts from London.

SELECT SPJ.SID
FROM SUPPLY SPJ, PART P
WHERE SPJ.PID = P.PID
AND P.CITY = 'LONDON'
GROUP BY SPJ.SID
HAVING COUNT(DISTINCT SPJ.PID) > 2;
/*
+-----+
| SID |
+-----+
| S5  |
+-----+
*/
-- Find supplier names for suppliers whi supply all the parts.

-- We used this query in the previous week.
SELECT S.SNAME
FROM SUPPLIER S
WHERE NOT EXISTS (SELECT *
                  FROM PART P
                  WHERE NOT EXISTS(SELECT *
                                  FROM SUPPLY SPJ
                                  WHERE SPJ.SID = S.SID
                                  AND SPJ.PID = P.PID));

-- Now, we can rewrite to
SELECT S.SNAME
FROM SUPPLIER S, SUPPLY SPJ
WHERE S.SID = SPJ.SID
GROUP BY SPJ.SID, S.SNAME
HAVING COUNT(DISTINCT SPJ.PID) = (SELECT COUNT(*) FROM PART);

/*
GROUP BY and HAVING are used to answer whether the number of
different parts a supplier supplies is the same as the total number of different parts.

Why are we grouping on both the supplier id and the supplier name?
The reason for that is that the query's asking for the names of suppliers.
Now, if we were to group just on the names of the suppliers,
we won't be handling the scenario where the more than supplier has the same name.

You'll recall that any attributes in the select statement must be single-valued for each partition,
which means it must be one of the GROUP BY attributes or an aggregate function.
If we don't add SNAME to our GROUP BY attributes,
then SNAME in the select clause is not going to be single-valued,
and it will actually give you an error.

You will see the error "'S.SNAME' isn't in GROUP BY".

While you need to add SNAME to GROUP BY, does adding SNAME here affect the partitioning?
The answer is NO.
Because SID is a unique identifier for every supplier,
there can only ever be one SNAME for every supplier number.
Therefore, adding extra attributes from the same table to the GROUP BY
will not affect the grouping or the partitioning.
*/

UNION operator #

-- Find the cities which have a supplier or project associated with them.

SELECT CITY
FROM SUPPLIER
UNION
SELECT CITY
FROm PROJECT;
/*
+--------+
| CITY   |
+--------+
| LONDON |
| PARIS  |
| ATHENS |
| ROME   |
| OSLO   |
+--------+
*/

-- UNION operator omit duplicates by default. If you need them, use UNION ALL.

SELECT CITY
FROM SUPPLIER
UNION ALL
SELECT CITY
FROM PROJECT;
/*
+--------+
| CITY   |
+--------+
| LONDON |
| PARIS  |
| PARIS  |
| LONDON |
| ATHENS |
| PARIS  |
| ROME   |
| ATHENS |
| ATHENS |
| LONDON |
| OSLO   |
| LONDON |
+--------+
*/

Handling NULL values in SQL: Three-valued logic #

The basic rules for evaluating arithmetic expressions or aggregate functions involving null values are:

  1. Any arithmetic expression involving a null value itself evaluates to a null value.
  2. Hence, the following expressions all evaluate to a null value if A (in the first two cases) or either A or B (in the remaining cases) is a null value:
    • A, -A, A + B, A - B, A * B, A / B
  3. Aggregate function(COUNT, SUM, AVG, MAX, MIN) are evaluated with any null values simply being ignored except in the case of COUNT(*) where null values are treated as ordinary values.
  4. If the argument of an aggregate function is an empty set, COUNT returns zero, while all other aggregate functions return a null value.

In the presence of NULL values, predicates do not necessarily evaluate to true or false. This is where we introduce the concept of a three-valued logic where a third truth value of ‘unknown’ is needed. We have three possible truth values: ’true’, ‘false’, or ‘unknown’.

  • Any comparison (= <> < <= > >=) involving NULL values evaluates to ‘unknown’. Hence, if either A or B is a null value, A = B is ‘unknown’.
  • X IN (nested query) evaluates to
    • ’true’ if the expression X = Y evaluates to ’true’ for at least one Y in the result of the nested query
    • ‘false’ if the expression X = Y evaluates to ‘false’ for every Y in the result of the nested query, or if the nested query result is empty.
    • ‘unknown’ otherwise.
  • As SQL is defined in the context of a 3 value logic, the result of the Boolean operators (AND, OR, NOT) must also be specified for the three values.
+---------+---------+-------+---------+
| [AND]   | true    | false | unknown |
+---------+---------+-------+---------+
| true    | true    | false | unknown |
| false   | false   | false | false   |
| unknown | unknown | false | unknown |
+---------+---------+-------+---------+

+---------+------+---------+---------+
| [OR]    | true | false   | unknown |
+---------+------+---------+---------+
| true    | true | true    | true    |
| false   | true | false   | unknown |
| unknown | true | unknown | unknown |
+---------+------+---------+---------+

+---------+---------+
|         | [NOT]   |
+---------+---------+
| true    | false   |
| false   | true    |
| unknown | unknown |
+---------+---------+
  • The definition EXISTS is unaffected by null values
    • EXISTS return ’true’ if the nested query result is non-empty, and ‘false’ otherwise.
  • The basic semantics of a WHERE clause are unaffected by null values
    • only those rows for which the WHERE clause evaluates to ’true’ figure in the resulting table. If the WHERE clause evaluates to ‘unknown’ for a row, it is omitted from the result.

Table and Join expressions #

We looked at how we can join our tables together using a Cartesian product by specifying the tables in the FROM clause, followed by relational algebra selection by specifying the join conditions in the WHERE clause. We also looked at how you can achieve joins between tables using IN subqueries and EXISTS subqueries.

/*
We have only seen examples of explicit table references in FROM clauses so far.
However, the full SQL standard allows the use of more general expressions
to specify a table in a FROM clause.

We can consider this subquery in our FROM clause as forming a temporary table in the query processor.
We're giving that temporary table an alias P2_S.
*/
SELECT DISTINCT S.SNAME
FROM SUPPLIER S,
  (SELECT DISTINCT SID
   FROM SUPPLY
   WHERE PID = 'P2') P2_S
WHERE P2_S.SID = S.SID;
/*
JID attribute is not present in this conceptual table.
*/
SELECT DISTINCT S.SNAME, P2_S.JID -- 👈 So, unfortunately, this doesn't work and raises an error.
FROM SUPPLIER S,
  (SELECT DISTINCT SID
   FROM SUPPLY
   WHERE PID = 'P2') P2_S
WHERE P2_S.SID = S.SID;

/*
You should:
*/
SELECT DISTINCT S.SNAME, P2_S.JID
FROM SUPPLIER S,
  (SELECT DISTINCT SID, JID -- 👈 You need this. Now it works.
   FROM SUPPLY
   WHERE PID = 'P2') P2_S
WHERE P2_S.SID = S.SID;

Now, we’re going to introduce the JOIN operators. By using the JOIN operators in SQL, we can also handle NULL values when we join tables together.

-- By default, when we just specify join, the SQL language defaults to an inner join.

-- case1
SELECT *
FROM SUPPLIER JOIN SUPPLY USING (SID);

-- case2
SELECT *
FROM SUPPLIER S JOIN SUPPLY SPJ ON S.SID = SPJ.SID;

/*
In case1, we will just have one SID column.
In case2, we sill have two SID column (from left and right tables).

case1:
+-----+-------+--------+--------+-----+-----+----------+
| SID | SNAME | STATUS | CITY   | PID | JID | QUANTITY |
+-----+-------+--------+--------+-----+-----+----------+
| S1  | SMITH |     20 | LONDON | P1  | J1  |      200 |
| S1  | SMITH |     20 | LONDON | P1  | J4  |      700 |
| S2  | JONES |     10 | PARIS  | P3  | J1  |      400 |
| S2  | JONES |     10 | PARIS  | P3  | J2  |      200 |
| ...                                                  |
+-----+-------+--------+--------+-----+-----+----------+

case2:
+-----+-------+--------+--------+-----+-----+-----+----------+
| SID | SNAME | STATUS | CITY   | SID | PID | JID | QUANTITY |
+-----+-------+--------+--------+-----+-----+-----+----------+
| S1  | SMITH |     20 | LONDON | S1  | P1  | J1  |      200 |
| S1  | SMITH |     20 | LONDON | S1  | P1  | J4  |      700 |
| S2  | JONES |     10 | PARIS  | S2  | P3  | J1  |      400 |
| S2  | JONES |     10 | PARIS  | S2  | P3  | J2  |      200 |
| ...                                                        |
+-----+-------+--------+--------+-----+-----+-----+----------+
*/