Databases

[mathjax]

Principles of Information and Data Management with Tomasz Imielinski

Description

Describing and querying various forms of information such as structured data in relational databases, unstructured text (IR), semi-structured data (XML, web), deductive knowledge. Conceptual modeling and schema design. Basics of database management system services (transactions, reliability, security, optimization). Advanced topics: finding patterns in data, information mapping and integration. The course focuses on a user’s perspective, rather than how one implements DBMS.

  • Credits: 49
  • Prerequisites:
    • 01:198:112; 01:198:205 or 14:332:312.

Please note that courses for which a student has received a grade of D cannot be used to satisfy prerequisite requirements.

  • Semesters Offered:
    • Spring, summer and fall
  • Expected Work:
    • Homework and programming assignments; project
  • Exams:
    • Midterm exam
    • Final exam

Department Learning Goals:

  • Computer Science majors …
    • will be prepared to contribute to a rapidly changing field by acquiring a thorough grounding in the core principles and foundations of computer science (e.g., techniques of program design, creation, and testing; key aspects of computer hardware; algorithmic principles).
    • will acquire a deeper understanding on (elective) topics of more specialized interest, and be able to critically review, assess, and communicate current developments in the field.
    • will be prepared for the next step in their careers, for example, by having done a research project (for those headed to graduate school), a programming project (for those going into the software industry), or some sort of business plan (for those going into startups).

September 4th, 2013 Reading, SQL, pages 243-273

  • SQL stands for “structured query language.”

Simple Queries in SQL

  • The simplest form of query in SQL asks for those tuples of some one relation that satisfy a condition.
  • SELECT,FROM,and WHERE characterize SQL.
  • Here’s an example scheme:
    Movies(title, year, length, genre, studioName, producerC#)
    StarsIn(movieTitle, movieYear, starName) 
    MovieStar(name,address, gender, birthdate) 
    MovieExec(name,address, cert#, netWorth)
    Studio(name, address, presC#)
    
  • To get all movies by Disney, for instance, you’d query:
    SELECT *    // get all
    FROM Movies // the movies 
    WHERE studioName = ’Disney’ AND year = 1990; // from Disney in 1990 
    
  • Here’s what each of the clauses do:
    1. The SELECT clause tells which attributes of the tuples matching the condition are produced as part of the answer.
      • The * in our case gets all the tuples.
    2. The FROM clause gives the relation or relations to which the query refers.
      • Getting “Movies” means you get all the movies …
    3. The WHERE clause is a condition, much like a selection-condition in rela­tional algebra.
      • This uses predicate logic.
  • A trick for reading and writing queries, read them in the following order:
    1. the FROM clause, to learn which relations are involved;
    2. the WHERE clause, to learn what about the tuples is important
    3. the SELECT clause to see what the output is

Projection in SQL

  • If you only wanted to see qualities about the movie in the example above instead of all of the information, that would be in the SELECT clause, you could change it to just title, length, for instance.
  • If you want to change the title header for a column, you can do so with the AS keyword,
    SELECT title AS name
    
  • If you want to change the value of all the elements in a column, you can add expressions to SELECTs:
    SELECT length*0.016667 AS lengthlnHours
    
  • SQL is case-insensitive.

Selection in SQL

  • Six comparison operators:
    =, <>, <, >, <=, and >=
    
    • <> is the SQL symbol for “not equal to”
  • The concatenation operator in SQL is ||:
    ’foo’ || ’bar’ = ’foobar’
    

Comparison of Strings

  • When you compare strings with the operators, you are asking about lexicographic values.

Pattern Matching in SQL

  • To use patterns, you can add a LIKE statement to the WHERE clause. Example:
    WHERE title LIKE 'Star ____'
    
  • Returns Star Wars, Star Gate, Star Trek, …

Dates and Times

  • SQL implementations are very specific about dates.
  • The format described here is the SQL standard.
  • DATE '1948-05-14' follows the required form.
  • There’s also time, TIME '15:00:02.5' or TIME '12:00:00-8:00'
  • You can combine the two with timestamps: TIMESTAMP '1948-05-14 12:00:00'

Null Values and Comparisons Involving NULL

  • There are three reasons that SQL will return a NULL:
    1. Value unknown: that is, “I know there is some value that belongs here but I don’t know what it is.”
      • There’s an answer in principle but this database doesn’t have one in practice.
    2. Value inapplicable: “There is no value that makes sense here.”
      • It’s possible to have an answer, it just happens that there isn’t here.
    3. Value withheld: “We are not entitled to know the value that belongs here.”
  • If you operate on NULL, the return value will always be NULL.
  • If you compare on NULL, the return value will always be UNKNOWN.
    • This is a SQL truth value.
  • Interestingly enough, if you place NULL in a variable name, you can get these legal comparisons and operations. However, if you do the operation on NULL directly, it is not valid SQL.

The Truth-Value UNKNOWN

  1. The AND of two truth-values is the minimum of those values. That is, x AND y is FALSE if either x or y is FALSE; it is UNKNOWN if neither is FALSE but at least one is UNKNOWN, and it is TRUE only when both x and y are TRUE.
  2. The OR of two truth-values is the maximum of those values. That is, x OR y is TRUE if either x or y is TRUE; it is UNKNOWN if neither is TRUE but at least one is UNKNOWN, and it is FALSE only when both are FALSE.
  3. The negation of truth-value v is 1 —v. That is, NOT x has the value TRUE when x is FALSE, the value FALSE when x is TRUE, and the value UNKNOWN when x has value UNKNOWN.

Ordering the Output

  • You can add a fourth clause the the SELECT-FROM-WHERE semantic to order the output.
    • Aptly named, ORDER BY and it accepts a list of attributes.
    • The default is ascending, and you can make it descending by adding the keyword DESC.
  • To order movies by length and then alphabetically from the running example, you’d use this clause:
    ORDER BY length, title;
    

Queries Involving More Than One Relation

  • The set-theoretic operations — union, intersection, and difference — appear directly in SQL.
  • It seems as though the names of table headers must be unique, as if you wanted any information from multiple columns, you can do this:
    WHERE title = ’Star Wars’ AND producerC# = cert#;
    
    • This returns the producers of Star Wars, whatever you select from it.

Disambiguating Attributes

  • Ah! No, you do not have to have them unique across tables, and the book covers this next.
  • Imagine this:
    MovieStar(name,address, gender, birthdate) 
    MovieExec(name, address, cert#, netWorth)
    
    • The way that you disambiguate them is with the following semantic:
      WHERE MovieStar.address = MovieExec.address;
      
    • The relation name, followed by a dot, is permissible even in situations where there is no ambiguity.

Tuple Variables

  • You may want to select from the same table twice, and you can do this with aliases.
    FROM MovieStar Starl, MovieStar Star2
    

Subqueries

September 5th, 2013 Introduction to SQL

Why SQL?

  • SQL is a very-high-level language.
    • Say “what to do” rather than “how to do it.”
    • Avoid a lot of data-manipulation details needed in procedural languages like C++ or Java.
      • In systems like MangoDB, you do a lot of data manipulation yourself.
      • The name of the game now is scalabilty and millions and millions of users.
      • SQL scales very well with ad-hoc, indeterminate number of queries.
    • Why not? Database management systems figures out best way to executre queiry.
      • Called query manipulation.

Select-From-Where Statements

  • SELECT desired attributes.
  • FROM one or more tables.
  • WHERE condition about tuples of the tables.

Our Running Examples

  • All our SQL queries will be based on the following database schema.
    Beers(name, manf) 
    Bars(name, addr, license)
    Drinkers(name, addr, phone) 
    Likes(drinker, beer) 
    Sells(bar, beer, price) 
    Frequents(drinker, bar)
    
  • From this you can find out who sells Heineken, you can find people who only frequent bars that sell beers they like.
    • This is the “heart of the matter.”
  • Ad-hoc queries are unpredictable, and this is where SQL shines.
    • You can find any objects in the database, you don’t have to write code. … Very difficult, you write it, and a week later to think, “What the hell did I do?”

Example What beers are made by Anheuser-Busch

  • Query:

    SELECT name FROM Beers WHERE manf = ’Anheuser-Busch’;

  • Result:

    Bud Bud Lite Michelob …

Meaning of Single-Relation Query

  • Begin with the relation in the FROM clause.
  • Apply the selection indicated by the WHERE clause.
  • Apply the extended projection indicated by the SELECT clause.

NULL Values in SQL

  • The logic of conditions in SQL is really 3-valued logic: TRUE, FALSE, UNKNOWN.
  • Comparing any value (including NULL itself) with NULL yields UNKNOWN.
  • A tuple is in a query answer if and only if the WHERE clause is TRUE (not FALSE or UNKNOWN).

September 17th, 2013 Keywords for Quiz 1

  • JOIN: used to combine rows from two or more tables, based on a common field between them.
    SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate 
    FROM Orders
    INNER JOIN Customers
    ON Orders.CustomerID=Customers.CustomerID;
    
  • UNION: combines the result of two or more SELECT statements.
    SELECT column_name(s) FROM table1
    UNION
    SELECT column_name(s) FROM table2;
    
  • INTERSECT: allows you to return the set theoretic intersection of the results of 2 or more “select” queries
    select field1, field2, . field_n
    from tables
    INTERSECT
    select field1, field2, . field_n
    from tables;
    
  • MINUS: returns all rows in the first SQL SELECT statement that are not returned in the second SQL SELECT statement
    select field1, field2, ... field_n
    from tables
    MINUS
    select field1, field2, ... field_n
    from tables;
    
  • EXISTS: this is a “to be met condition,” where the select statement of the super query operates on the returned table.
    SELECT *
    FROM suppliers
    WHERE EXISTS (  select *
                    from orders
                    where suppliers.supplier_id = orders.supplier_id);
    
  • NOT EXISTS: the NOT operator can be combined with the EXISTS operator to operate on a subquery does not satisfy it’s where condition.
  • ALL: used as a union that allows duplicates
    SELECT column_name(s) FROM table_name1
    UNION ALL
    SELECT column_name(s) FROM table_name2
    
  • ORDER BY (DESC): When sorting your result set in descending order, you use the DESC attribute in your ORDER BY claus
    SELECT supplier_city
    FROM suppliers
    WHERE supplier_name = 'IBM'
    ORDER BY supplier_city DESC;
    
  • COUNT: returns the number of rows in a query
    SELECT COUNT(expression)
    FROM tables
    WHERE predicates;
    

September 24th, 2013 Chapter 2, Relational Model

An Overview of Data Models

  • A data model is a notation for describing data. It has three parts:
    1. Structure of the data.
    2. Operations on the data.
    3. Constraints on the data.
  • There are two preeminent database systems:
    1. The relation model, include object-relational extensions.
    2. The semistructured-data model, including XML and related standards.
      • In brief, this is a way to represent data by hierarchically nested tags.
      • Often talked about as a tree with parents and children nodes.

Basics of the Relational Model

Attributes

  • The columns of a relation are named attributes.
    • For example, title, year, length, and genre if the the relation was Movies.

Schemas

  • The name of a relation and the set of attributes for a relation is called the schema for that relation.
    • In the case of movies,
      Movies(title, year, length, genre)
      
  • Attributes are a set, not a list.

Tuples

  • The rows of a relation, other than the header row, are called tuples.
    • A tuple has one component for each attribute of the relation.

Domains

  • The relation model requires that each component of each tuple be atomic;
    • It must be of some elementary type (integer, string, etc.)
    • It is not permitted for it to be a record structure, set, list, array, or anything that can be broken into smaller elements.
  • It is further assumed that associated with each attribute of a relation is a domain, a particular elementary type.
    Movies(title:string, year:integer, length:integer, genre:string)
    

Keys of Relations

  • A set of attributes forms a key for a relation if we do not allow two tuples in a relation instance to have the same values in all attributes of the key.

Defining a Relation Schema in SQL

  • SQL is the principal language used to describe and manipulate relational databases. There are two parts:
    1. The Data-Definition sublanguage for declaring schemas;
    2. The Data-Manipulation sublanguage for querying and modifying the database.

Relations in SQL

  • SQL makes a distinction between three kinds of relations:
    1. Stored relations, called tables.
      • These are the kind we deal with typically.
      • Can be modified by changing it’s tuples and queried.
    2. Views, which are relations defined by a computation.
      • These relations are not stored, but are constructed.
    3. Temporary tables, which are constructed by the SQL language processor when it performs its job of executing queries and data modifications.
      • These relations are not stored.

Data Types

  1. Character string of fixed or varying length.
  2. Bit string of fixed or varying length.
  3. The type BOOLEAN, TRUE, FALSE, and UNKNOWN.
  4. The type INT or INTEGER.
  5. Floating point numbers, FLOAT or REAL.
  6. Dates and times can be represented with the types DATE and TIME.

Simple Table Declarations

CREATE TABLE Movies (
    title       CHAR(IOO), 
    year        INT, 
    length      INT,
    genre       CHAR(10), 
    studioName      CHAR(30), 
    producerC#      INT
);

An Algebraic Query Language Set Operations on Relations

  • There are four broad classes of operations:
    1. The usual set operations;
      1. The union of R and S, is the set of elements that are in R or S or both. Elements appear only once despite being in multiple times;

        R ∪ S

      2. The intersection of R and S, is the set of elements that are in both R and S;

        R ∩ S

      3. The difference of R and S, is the set of elements that are in R but not in S.

        R − S

    2. Operations that remove parts of a relations;
      1. Selection, produces a new relation with a subset of R’s tuples, filter tuples, where C is a condition:

        σCR

      2. Projection is used to produce from a relation R a new relation that has only some of R’s columns, filter attributes:

        πA1, A2, …, AnR

    3. Operations that remove parts of the relation;
      1. Cartesian product, combines relations in all possible ways,
      2. Joins, selectively pair tuples from two relations,
    4. Renaming, which changes the attributes and/or name of relation.
  • These are queries.
  • We want to use these operations for queries, but there are these constraints:
    • R and S must have schemas with identical sets of attributes.
    • Before we compute the set-theoretic union, intersection, or difference of sets of tuples, the columns of R and S must be ordered so that the order of attributes is the same for both relations.

September 24th, 2013 Project 1

Your objective will be to extend significantly the famous db scheme with new attributes and new relations. It has to be realistic and interesting. Add info about facebook friends of drinkers? Offer information about bar’s annual sales? Provide more geo-location data about drinkers and bars? Temporal data? Sky is the limit –). Once you settle of the db scheme, you will have to populate it with realistic tuples. By realistic, I mean names of bars, drinker names, dollar figures etc when appropriate. No a1, b1, c1! No drinker X and drinker Y! Generate and load your db with the large number of synthetic tuples, may be 10,000? May be more. It is your choice. But your instances should not be completely random. On the contrary – you should think of a few patterns which will have business significance. For example, the larger the bar, the less it charges for a beer (most of the time). The older the drinker, the less frequently s/he goes to a bar. Evening beer sales are higher than afternoon beer sales. Embed a number of such patterns in your data and construct SQL queries which would prove that indeed your db satisfies such pattern. Your patterns should be intuitive and have actionable business value either for a bar o for a drinker.

Thus, there following tasks:

  1. Scheme definition
  2. Realistic db instance generation and db loading + Pattern embedding in the instance as well as validation of the pattern using sql query
  3. Simple DB GUI with search boxes which will find something interesting about your data.

October 1st, 2013 Reading: pg. 125-171

The Entity/Relationship Model

  • In the E/R model, the structure of data is represented graphically, as an “ER diagram”, using three principal element types:
    1. Entity sets,
    2. Attributes, and
    3. Relationships.

Entity Sets

  • An entity is an abstract object of some sort, a collection of similar entities is an entity set.
    • An entity in some ways represents an “object” in the sense of OO-programming.
    • Likewise, an entity set resembles a class of objects.
  • In the movie example, each movie is an entity, and the set of all movies constitute an entity set.

Attributes

  • Entity sets have associated attributes, which are properties of the entities in that set.
    • The entity set Movies might be given attributes such as title and length.

Relationships

  • Relationships are connections among two or more entity sets.
    • For instance, if Movies and Stars are two entity sets, we could have relationship Stars-in that connects movies and stars.

Entity-Relationship Diagrams

  • Entity sets are represented by rectangles.
  • Attributes are represented by ovals.
  • Relationships are represented by diamonds.
    • “Is-a relationships” are special and represented with triangles.
  • Edges connect an entity set to its attributes and also connect a relationship to its entity set.

Multiplicity of Binary E/R Relationships

  • If each member of E can be connected by R to at most member of F, then we say that R is a many-one from E to F.
    • Note that in a many-one relationship from E to F, each entity in F can be connected to many members of E.
  • If R is both many-one from E to F and many-one from F to E, then we say that R is one-one.
    • In a one-one relationship an entity of either entity set can be connected to at most one entity of the other set.
  • If R is neither many-one from E to F or from F to E, then we say R is many-many.

Multiway Relationships

  • The E/R model makes it convenient to define relationships involving more than two entity sets. In practice, ternary or higher degree relationships are rare, but occasionally are necessary to represent the true state of affairs.
    • They are represented with lines between all involved entity sets.
  • An arrow pointing to an entity set E means that if we select one entity from each of the other entity sets in the relationships, those entities are related to at most entity in E.

Roles in Relationships

  • It is possible that one entity sets appears two or more times in a single relationship.
    • If so, we draw as many lines from the relationship to the entity sets as the entity set appears in the relationship.
    • Each line to the entity set represents a different role that the entity set plays in that relationship.

Design Principles

Faithfulness

  • First and foremost, the design should be faithful to the specs.
    • The entity sets should reflect reality.

Avoid Redundancy

  • We should be careful to say everything once only.
  • This is dangerous because:
    1. Doing so leads to repetition of fact, with the result that the extra space is required to represent the data.
    2. There is an update-anomoly potential, since we might change the relationship but not the attribute.

Simplicity Counts

  • Avoid introducing more elements into your design than is absolutely necessary.

Choosing the Right Relationships

  • Entity sets can be connected in various ways by relationships.
    • However, adding

Constraints in the E/R model

Keys in the E/R Model

  • A key for an entity set E is a set K of one or attributes such that, given any two distinct entities e1 and e2, e1 and e2 cannot have identical values for each of the attributes in the key K.
    • Every entity set must have a key, although in some cases, isa-hierarchies and “weak” entity sets, the key actually belongs to another entity set.
    • There can be more than one possible key for an entity set. However, it is customary to pick one key as the “primary key,” and to act as if that were the only key.
    • When an entity set is involved in an isa-hierarchy, we require that the root entity set have all the attributes needed for a key, and that the key for each endtity is found from its component in the root entity set, regardless of how many entity sets in the hierarchy have components for the entity.

Representing Keys in the E/R Model

  • In our E/R-diagram notation, we underline the attributes belonging to a key for an entity set.

Referential Integrity

  • The arrow notation in E/R diagrams is able to indicate whether a relationship is expected to support referential integrity in one or more directions.
    • Suppose R is a relationship from entity set E to entity set F. A rounded arrow-head poring to F indicates not only that the relationship is many-one from E to F, but that the entity of set F related to a given entity of set E is required to exist.
      • The same idea applies when R is a relationship among more than two entity sets.

Degree Constraints

  • In the E/R-model, we can attach a bounding number to the edges that connect a relationship to en entity set, indicating limits on the number of entities that can be connected to any one entity of the related entity set.

Weak Entity Sets

  • It is possible for an entity set’s key to be composed of attributes, some or all of which belong to another entity set.
    • Such an entity set is called weak entity set.

Causes of Weak Entity Sets

  • There are two principle reasons we need weak entity sets.
    1. Sometimes entity sets fall into a hierarchy based on classifications unrelated to “is-a hierarchy.”
    2. Connecting entity sets that as a way to eliminate a multiway relationship.

Requirements for Weak Entity Sets

  • If E is a weak entity set then its key consists of:
    1. Zero or more of its own attributes.
    2. Key attributes from entity sets that are reach by certain many-one relationships from E to other entity sets.
      • These many-one relationsips are called **supporting relationships** for E, and the entity sets reach from E are supporting entity sets.
  • In order for R, a many-one relationship from E to some entity set F, to be a supporting relationship for E, the following conditions must be obeyed:
    1. R must be binary, many-one relationship from E to F.
    2. R must have referential integrity from E to F.
      • For every E-entity, there must be exactly one existing F-entity related to it by R.
      • Put another way, a rounded arrow from R to F must be justified.
    3. The attributes that F supplies for the key of E must be key attributes of F.
    4. However, if F is itself weak, then some or all of the key attributes of one or more entity sets G to which F is connected by a supporting relationship.
      • Recursively, if F is weak, some key attributes of F will be supplied from elsewhere, and so on.

Weak Entity Set Notation

  • We shall adopt the following conventions to indicate that an entity set is weak and to declare its key attributes:
    1. If an entity set is weak, it will be shown as a rectangle weit a double border.
    2. Its supporting many-one relationships will be shown as diamonds with a double border.
    3. If an entity set supplies any attributes for its own key, the those attributes will be underlines.

From E/R Diagrams to Relational Designs

  • To a first approximations, converting an E/R design to a relational database schema is straightforward.
    • Turn each entity set into a relation with the same set of attributes.
    • Replace a relationship by a relation whose attributes are the keys for the connected entity sets.
  • While these etwo rules cover much of the ground, there are apse special situations that we need to deal with, including:
    1. Weak entity sets cannot be translated straightforwardly to relations.
    2. “Isa” relationship and subclasses require careful treatment.
    3. Sometimes, we do well to combine two relations, especially the relation for an entity set E and the relation that comes from E to some other entity set.

Summary

  • The Entity-Relationship Model: In the E/R model we describe entity sets, relationships among entity sets, and attributes of entity sets and relationships. Members of entity sets are called entities.
  • Entity-Relationship Diagrams: We use rectangles, diamonds, and ovals to draw entity sets, relationships, and attributes, respectively.
  • Multiplicity of Relationships: Binary relationships can be one- one, many-one, or many-many.
    • In an one-one relationship, an entity of either set can be associated with at most one entity in the other set.
    • In a many-one relationship, each entity in the “many” side is associated with at most one entity on the other side.
    • In a many-many relationship, there is no restriction.
  • Good design: Designing databased effectively requires that we represent the real world faithfully, that we select appropriate elements, and that we avoid redundancy.

October 1st, 2013 6.4 Full-Relation Operations

Eliminating Duplicates

  • SQL’s notion of relations differs from the abstract notion.
    • A relation, being a set, cannot have more than one copy of a given tuple.
    • When a SQL query creates a new relation, the SQL system does not ordinarily eliminate duplicates.
      • This, the SQL response to a query may list the same tuple several times.
  • One of the several equivalent definitions of the meaning of a SQL select-from-where query is that we begin with the Cartesian product of the relations referred to in the FROM clause.
    • Each tuple of the product is tested by the condition in the WHERE clause.
    • The ones that pass the test are given to the output for projection according to the SELECT clause.
  • If we do not wish duplicates in the result, then we may follow the keyword SELECT by the keyword DISTINCT.
    • This word tells SQL to produce only one copy of any tuple.

Aggregation Operators

  • SQL uses five aggregative operators SUM, AVG, MIN, MAX, and COUNT.
    • SUM adds up up the field it’s given.
    • AVG gets the average of the field given.
    • MIN finds the smallest
    • COUNT gets the number of fields

Duplicates in Unions, Intersections, and Differences

  • Unlike the SELECT statement, which preserves duplicates as a default and only eliminates them when instructed to by the DISTINCT keyword, the union, intersection, and difference operations normally eliminate duplicates.
    • If you want to include duplicates, you can do so by including the ALL keyword.

Grouping

  • To group tuples, we use a GROUP BY clause following the WHERE clause.

Grouping, Aggregating, and Nulls

  • The value NULL is ignored in any aggregation.
    • It does not contribute to sum, average, or count of an attribute.
    • Nor can it be the minimum or maximum of a column.
  • NULL is treated as an ordinary value when forming groups.
    • That is, we can have a group in which one or more of the grouping attribute are assigned the value NULL.
  • When we perform any aggregation exception count over an empty bag, the result is NULL. The count of an empty bag is 0.

October 8th, 2013 Study

bars (name, addr, license, phone)
likes (drinker, beer)
frequents (drinker, bar)
sells (bar, beer, price)
beer (manf, name)

Questions 1

  • From Beers (name, manf), find those beers that are the only beer by their manufacturer.
    SELECT name
    FROM Beers b1
    WHERE _____ _____ ( -- NOT EXISTS
        SELECT *
        FROM ______ -- Beers
        WHERE manf = ______ -- b1.manf 
        AND ______    ______ b1.name -- name <>
    );
    
  • Find the drinkers and beers such that: The drinker likes the beer, and the drinker frequents at least one bar that sells the beer.
    (SELECT * FROM Likes)
    ________ -- intersect
    (SELECT ______,  _______ -- drinker, beer
    FROM Sells, Frequents
    WHERE _______ = _______); -- Frequents.bar == Sells.bar
    
  • From Sells (bar, beer, price) and Beers (name,manf), find the average price of those beers that are either served in at least three bars or are manufactured by Pete’s.
    SELECT beer, ______ -- AVG(Sells.price)
    FROM Sells
    GROUP BY beer
    ______ COUNT (bar) >= _____  -- WHERE, 3
        ______ beer _______ (SELECT name -- OR, IN
                            FROM Beers
                            WHERE manf = ______); -- "Pete's"
    
  • Find the average price for each beer that is sold by more than one bar in New Brunswick:
    SELECT beer, ______ -- AVG(price)
    FROM Sells
    WHERE address = ‘New Brunswick’
    _____   _____ beer -- GROUP BY
    HAVING _____ > 1 -- COUNT(bar)
    
  • Drinkers who live together frequent same bar(0/1).
    SELECT IF (COUNT (*)>0, 1, 0)
    FROM (SELECT ______, d1.addr 
            FROM drinkers d1, drinkers d2
            WHERE d1.name_____     _____ -- <> d2.name
                AND ____=_____) d, frequents -- d1.addr, d2.addr
    WHERE ______=frequents.drinker
    

Questions 2

  • Find bars where either Tim or John drink.
    SELECT bars
    FROM frequents f1
    WHERE f1.drinker = 'Tim'
    UNION
    SELECT bars
    FROM frequents f2
    WHERE f2.drinker = 'John'
    
  • Which bar sells cheapest ‘Bud’ ?
    SELECT bar
    FROM Sells
    WHERE beer = "Bud" AND (
        SELECT MIN(price)
        FROM Sells
        WHERE beer = "Bud"
    )
    
  • For each bar that serve both ‘heineken’ and ‘bud’, count the number of drinkers that frequent that bar.
    SELECT bar, COUNT(DISTINCT f.drinker)
    FROM frequents f
    WHERE EXISTS (SELECT 
            FROM Sells s1, Sells s2
            WHERE s1.beer == "Heineken"
                AND s2.beer == "Bud"
                AND s1.bar == s2.bar
    )
    GROUP BY bar
    
  • Find the average price of common beers (where “common” means those that are served in more than two bars).
    SELECT beer, AVG(price)
    FROM Sells
    GROUP BY beer
    HAVING COUNT(Sells.bar) > 2
    
  • Bars having common drinkers must have common beers (false/true)
  • There is at least one beer which is sold at all bars. (true/false)

Questions 3

  • Beer that has the maximum average price over the bars.
    SELECT s.beer 
    FROM sells s 
    __________ __________ __________ -- GROUP BY BEER
    HAVING AVG(price) __________  __________ ( -- >= ALL
        SELECT beer, __________ (price) AS av_p -- AVG
        FROM sells 
        __________ __________ __________) -- GROUP BY BEER
    

October 10th, 2013 Lecture: Design Theory for Relational Databases

Functional Dependency

  • X -> Y is an assertion about a relation R that whenever two types of R agree on all the attributes of X, then they must also agree on all attributes in set Y.
    • Say “X -> Y holds in R.”
    • Convention: X, Y, and Z represent sets of attributes.

Splitting Right Sides of FDs

  • X -> A_1, A)2, ... A_n holds for R exactly when each of the expressions of for R.

Example

Drinkers(name, addr, beersLiked, manf, favBeer)
  • Reasonable FDs to assert:
    1. name -> addr favBeer
    2. beersLiked -> manf

October 21st, 2013 Quiz 4 Study

Relational Model

  • The relational model for database management is a database model based on first-order predicate logic.
    • all data is represented in terms of tuples
    • grouped into relations
  • A table in an SQL database schema corresponds to *a predicate variable*;
    • the contents of a table to a relation;
    • key constraints, other constraints, and SQL queries correspond to predicates

What is the core limitation of relational model?

Application to Databases

  • A data type as used in a typical relational database might be the set of integers, the set of character strings, the set of dates, or the two boolean values true and false, and so on.
  • Attribute is the term used in the theory for what is commonly referred to as a column.
    • Similarly, table is commonly used in place of the theoretical term relation.
  • A tuple is basically the same thing as a row

Keys

  • A primary key uniquely specifies a tuple within a table.
Method for finding keys
  1. If the letter is in the relation and not in the FDs, it must be a key.
  2. If the letter appears on the left but never on the right in the FDs, then it must be in the key.
  3. If the letter appears on the right but never on the left, then it cannot be in the key.
  4. If the letter is on both sides, you cannot not use this method.

Superkeys

  • A superkey is defined in the relational model of database organization as a set of attributes of a relation variable for which it holds that in all relations assigned to that variable, there are no two distinct tuples (rows) that have the same values for the attributes in this set.

Closure of a set of Attributes

Normal Forms

Why normal forms? Do they always work and when they do not work?

BCNF

  • A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold:
    • X → Y is a trivial functional dependency (Y ⊆ X)
    • X is a superkey for schema R
Decomposition to BCNF

Third Normal Form

synthesis of 3NF from the minimal base of given family of functional dependencies

  • Codd’s definition states that a table is in 3NF *if and only if both of the following conditions hold*:
    • The relation R (table) is in second normal form (2NF)
    • Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every superkey of R.

1NF

  • A relation is in first normal form if *the domain of each attribute contains only atomic values*, and *the value of each attribute contains only a single value from that domain*.

Functional Dependencies

  • A functional dependency is a constraint between two sets of attributes in a relation from a database.
  • Given a relation R, a set of attributes X in R is said to functionally determine another set of attributes Y, also in R, (written X → Y) if and only if each X value is associated with precisely one Y value; R is then said to satisfy the functional dependency X → Y.

October 26th, 2013 Reading, pages 350-359

Indexes in SQL

Index
An index on an attribute A of a relation is a data structure that makes it efficient to find those tuples that have a fixed value for attribute A.
We could think of the index as a binary search tree of (key, value) pairs, in which a key a (one of the values that attribute A may have) is associated with a “value” that is the set of locations of the tuples that have a in the component for attribute A.
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and the use of more storage space to maintain the extra copy of data. Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed.
Index key
Note that the key for the index can be any attribute or *set of attributes*, and need not be the key for the relation on which the index is built.
We shall refer to the attributes of the index as the index key when a distinction needs to be made.

Motivation for Indexes

  • When relations are very large, it becomes expensive to scan all the tuples of a relation to find those (perhaps very few) tuples
  • Without indexes, we have to look at every tuple of the two relations.

Declaring Indexes

  • most commercial systems have a way for the database designer to say that the system should create an index on a certain attribute for a certain relation.
    CREATE INDEX YearIndex ON Movies(year);
    

Selection of Indexes

  • Two important factors to consider are:
    • The existence of an index on an attribute may speed up greatly the execution of those queries in which a value, or range of values, is specified for that attribute, and may speed up joins involving that attribute as well.
    • On the other hand, every index built for one or more attributes of some relation makes insertions, deletions, and updates to that relation more complex and time-consuming.

Some Useful Indexes

  • Often, the most useful index we can put on a relation is an index on its key. There are two reasons:
    1. Queries in which a value for the key is specified are common. Thus, an index on the key will get used frequently.
    2. Since there is at most one typle with a given key value, the index returns either nothing or one location for a typle. Thus, at most one page must be retrieved to get that tuple into main memory.

Automatic Selection of Indexes to Create

  1. The first step is to establish the query workload. Since a DBMS normally logs all operations anyway, we may be able to examine the log and find a set of representative queries and database modifications for the database at hand. Or it is possible that we know, from the application programs that use the database, what the typical queries will be.
  2. The designer may be offered the opportunity to specify some constraints, e.g., indexes that must, or must not, be chosen.
  3. The tuning advisor generates a set of possible candidate indexes, and evaluates each one. Typical queries are given to the query optimizer of the DBMS. The query optimizer has the ability to estimate the running times of these queries under the assumption that one particular set of indexes is available.
  4. The index set resulting in the lowest cost for the given workload is suggested to the designer, or it is automatically created.

November 9th, 2013 Study Guide to Quiz

Data Access

I/Os

Assume: Tables Sells, Likes and Drinker. Sells has 100,000 records, Likes has 1 million records and Drinker has also 100,000 records. Assume that each block of data (page) holds 100 records.

  • Assume average seek time is 10ms. and average block transfer time is 5ms.
  • Take the largest table, Likes.
  • Time to read Likes:

    Transfer + Seek time = 10,000 blocks x 5ms = 50 seconds + 10ms

  • Time to find a specific record by sequentially scanning the file (does Joe like Sam Adams?) – on average half of all the blocks will have to be transferred and read, thus it is > 5,000 blocks x 5ms = 25 seconds + 10ms (seek time)
  • Now, lets assume that table Drinker has an index on the key attribute which is the Drinker’s name. Lets assume that index is organized as binary search tree and that drinkers table is ordered by drinker’s names. There is 1000 blocks and 1000 index entries. If there was miniscule amount of RAM – technically accessing each level in the binary search tree would generate a disk seek, for 1000 entries, around 12! But realistically, the whole index would easily fit to RAM even 20 years ago. So it would be 1 seek time and probably several blocks of data to transfer the index to RAM. Then perhaps 1 seek time + 1 transfer for the block which may contain the record in drinker’s file. Thus, realistically we would need the total of few transfer times and 2 x seek time = milliseconds!
  • In the class slides – the options A3 and A4 (and A5 and A6) all account for the height of the index (number of index levels, hi). But we need to be realistic. With current large RAM sizes I do not believe one would have to access index one block at a time, that is assuming that only one block of an index fits to the main memory. This time is long gone. I am assuming that there is enough RAM to hold all index there. Thus, to access index only one seek time is needed (assuming index is located contiguously) and several block transfers to move to index from the disk to the main memory.
  • So there are really only two cases worth comparing: the primary and the secondary indexes. If index is primary all records with that index value are consecutive on the disk. If index is secondary they are not. This is particularly important for attributes which are not key attributes. Consider two hypothetical attributes of Drinker table: Zip Code and Gender. Lets assume there is 1000 drinkers per zip code. And lets assume that there is roughly the same number of male and female drinkers.
  • Lets start with index on zip code. Assume that zip code index is primary. In that case data on the disk is arranged by zip codes. There is ten blocks of “drinkers” per zip code. Thus we will need one seek for index to bring it from the disk and transfer several blocks of index (negligible) plus another seek time for the first block of zipcode 08901 and transfer time for ten blocks of data. Total: > 2 x seek time = 20 ms > 20 block transfers = 100 ms > Total of 120 ms
  • Now assume that index on zip code is not primary (clustering). Drinkers in 08901 zip code may now be located on different blocks of the Drinker file. While Index transfer and seek time remains unchanged, we may now have to seek and transfer 1000 blocks from the hard disk since in the worst case each of the drinkers may be on different block. This time will dominate the overall performance since it will be > 1000 x (10ms +5ms) = 15 seconds
  • Now, just imagine what would happen if we repeated the same reasoning for much less selective index on “Gender”! All in all, the overall performance of a simple select from one table query is determined by whether there is an index, how selective is an attribute and if index is primary (clustering) or secondary. Selectivity of an index and proximity of records on the disk are the main factors in determination of overall performance. As I said in class, using an index is not always a great idea. For example, if a query asks for all female drinkers and the index on gender is non-primary, one may need to access 50,000 drinkers and transfer blocks of data 50,000 times (assuming no caching of previously accessed blocks). This would be: > 50,000 x 15ms = 750 seconds!
  • This is as opposed just to reading the file sequentially – roughly 1000 x 10ms = 10 seconds (1000 blocks of data transferred and one additional seek)

Joins

  • This is where the real performance bottlenecks begin. One can easily choke the db system on joins. Even joining two tables may prove to be very expensive. But joins of more than two tables (even self joins) will quickly make your system unresponsive. Just to illustrate this points lets use calculations from the slides.
  • Suppose we make natural join of tables “sells” with “likes” on the beer attribute. For outer table always chose a smaller of the two tables – arguments of the natural join. Thus, our outer table will be the Sells table and our inner table will be Likes table.
  • In the worst case, if there is enough memory only to hold one block of each relation (OK, this is really not so realistic in this case!) the estimated cost is

$$nr ∗ bs + br$$

  • Which is $100,000 \times 10,000 + 1000$ block transfers = roughly one billon block transfers! That would be $5 x 10^6$ seconds! What is it, a few months or something?
  • And $100,000 + 1000$ seeks = $101,000$ seeks $= 1000 seconds$, very brisk comparing with the data transfer…..
  • Yes, true, very unrealistic, tuple at a time and practically no RAM. Below is a more realistic calculations with block nested loop join
  • Worst case estimate: $b_r \times b_s + b_r$ block transfers $+ 2 \times br $seeks
  • Each block in the inner relation s is read once for each block in the outer relation (instead of once for each tuple in the outer relation
  • Best case: $b_r + b_s$ block transfers $+ 2$ seeks.
  • Now we have $1000 \times 10,000 + 1000 = 10$ million block transfers this is merely just $50,000$ seconds.\
  • Finally suppose we have an index on “beer” which is an attribute we join both tables on Now the cost is:
  • Cost of the join: $b_r (t_T + t_S) + n_r ∗ c$
  • Where c is the cost of traversing index and fetching all matching s tuples for one tuple or r
  • c can be estimated as cost of a single selection on s using the join condition
  • Thus, on our case, it is 1000 block transfers and seeks (since now we bring one block at a time) for sells and then for each record in sells we will use index on beer 100,000 times just to retrieve the matching tuples in Likes in order to compute the natural join. Lets first assume there is 1000 beers so each of them is liked by 100 drinkers on average. If the index on beer is a primary one that would be just 1 block of drinkers. Overall that would translate to roughly only 100,000 I/Os (transfer + seek time), 500 seconds overall. But if the index on beer was non-primary, that would be 100 blocks and 10 million I/Os…..not nearly as good.

Serial and Serializable Schedules

Scheduler
The timing of individual steps of different transactions needs to be regulated in some manner. This is the job of a scheduler component of a DBMS.

+-----------------+
| Transaction     |
| Manager         |
+-----------------+
    | Read/write
    | requests
    v
+-----------------+
| Scheduler       |
|                 |
+-----------------+
    | Reads and
    | writes
    V
+-----------------+
|||||||||||||||||||
|||||||||||||||||||
+-----------------+
Concurrency control
The general process of assuring that transactions preserve consistency when executing simultaneously.
Serializability
Assuring that concurrently executing transactions preserve correctness of the database state.
Conflict-serializability
A stronger conditions that most schedulers actually enforce.

(Correctness Principle): Every transaction, if executed in isolation (without any other transactions running concurrently), will transform any consistent state to another consistent state.

Schedules

Schedule
A sequence of the important actions taken by one or more transactions.

Serial Schedules

Serial
A schedule is serial if its actions consist of all the actions of one transaction, then all the actions of another transaction, and so on. No mixing of actions is allowed.

$T_1$ $T_2$ $A$ $B$
READ(A,t) $25$ $25$
t := t + 100
WRITE(A,t) $125$
READ(B,t)
t := t + 100
WRITE(B, t) $125$
READ(A, s)
s := s * 2
WRITE(A, s) $250$
READ(B, s)
s := s * 2
WRITE(B, s) $250$

For the transaction here, there are two serial schedules. There in $T_1$ which precedes $T_2$, and the initial state is $A = B = 25$. We shall take the convention that when displayed vertically, time process down the page. Also the values of $A$ and $B$ shown refer to their valleys in main-memory buffers, not necessarily to their values on disk.

The transactions are executed non-interleaved (see example above) i.e., a serial schedule is one in which no transaction starts until a running transaction has ended.

Serializable Schedules

The Correctness Principle for Transactions
Every serial schedule will preserve consistency of the database state.
Serializable
In general, we say that a schedule $S$ is serializable if there is a serial schedule $S\prime$ such that for every initial database state, the effects of $S$ and $S\prime$ are the same.

$T_1$ $T_2$ $A$ $B$
READ(A,t) $25$ $25$
t := t + 100
WRITE(A,t) $125$
READ(A,s)
s := s * 2
WRITE(A, s) $250$
READ(B, t)
t := t + 100
WRITE(B, t) $125$
READ(B, s)
s := s * 2
WRITE(B, s) $250$

This shows a schedule of the transactions of the first example that is seriablizable but not serial. In this schedule, $T_2$ acts on $A$ after $T_1$ does, but before $T_1$ acts on $B$. However, we see that the effect of the two transactions scheduled in this manner is the same as for the serial schedule.

To convince ourselves of the truth of this statement, we must consider not only the effect from the database state $A = B = 25$, but from any consistent database state.

A transaction schedule is serializable if its outcome (e.g., the resulting database state) is equal to the outcome of its transactions executed sequentially without overlapping in time.
A schedule that is equivalent (in its outcome) to a serial schedule has the serializability property.

The Effect of Transaction Semantics

Any database element $A$ that a transaction $T$ writes is given a value that depends on the database state in such a way that no arithmetic coincidences occur.

A Notation for Transactions and Schedules

  • If we assume “no conincidences,” the only the reads and writes performed by the transactions matter, not the actual values involved.
    • Thus, we shall represent transactions and schedules by a shorthand notation, in which the actions are $r_T (X)$ and $w_T (X)$,
      • Meaning that transactions $T$ reads, or respectively writes, database element $X$.
  • The first table can be written:

$$T_1 : r_1(A); w_1(A); r_1(B); w_1(B)$$ $$T_2 : r_2(A); w_2(A); r_2(B); w_2(B)$$

  • To make notation precise:
    1. A action is an expression of the form $R_i(X)$ or $w_i(X)$, meaning that transaction $T_i$ reads or writes, respectively, the database element $X$.
    2. A transaction $T_i$ is a sequence of actions with subscript $i$.
    3. A schedule $S$ is a set of transactions $T$ is a sequences of actions, in which each transaction $T_i$ in $T$, the actions of $T_i$ appear in $S$ in the same order that they appear in the definition of $T_i$ itself. We say that $S$ is an interleaving of the actions of the transactions of which it is composed.

Conflict-Serializability pg. 800

  • Schedulers in commercial systems generally enforce a condition, called “conflict-serializablity”, that is stronger than the general notion of serialiability introduced ealier.

Conflicts

Conflict
A pair of consecutive actions in a schedule that, if their order is interchanged, then the behavior of at least one of transactions can change.
Two actions are said to be in conflict (conflicting pair) if:

  1. The actions belong to different transactions.
  2. At least one of the actions is a write operation.
  3. The actions access the same object (read or write).
Most pairs of actions do not conflict.
In what follows, we assume that $T_i$ and $T_j$ are different transactions:

  1. $r_i(X); r_j(Y)$ is never a conflict, even if $X = Y$. The reason is that neither of these steps change the value of any database element.
  2. $r_i(X); w_j(Y)$ is not a conflict provided that $X \neq Y$. The reason is that should $T_j$ write $Y$ before $T_i$ reads $X$, the value of $X$ is not change.
  3. $w_i(X); r_j(Y)$ is not a confict if $X \neq Y$.
Three ways that always conflict:

  1. Two actions of the same transactions always conflict.
  2. Two writes of the same database element by different transactions conflict. That is, $w_i(X); w_j(X)$ is a conflict.
  3. A read and a write of the same database element by different transaction also close.
Conflict equivalence
The schedules S1 and S2 are said to be conflict-equivalent if following two conditions are satisfied:

  1. Both schedules S1 and S2 involve the same set of transactions (including ordering of actions within each transaction).
  2. The set of conflicting pairs in S1 is the same as in S2.
Conflict-serializable
A schedule is said to be conflict-serializable when the schedule is conflict-equivalent to one or more serial schedules.

ACID

ACID
Atomicity, Consistency, Isolation, Durability
Atomicity
In an atomic transaction, a series of database operations either all occur, or nothing occurs.
Consistency
A consistent transaction is one that does not violate any integrity constraints during its execution. If a transaction leaves the database in an illegal state, it is aborted and an error is reported.
Isolation
Isolation is a property that defines how/when the changes made by one operation become visible to other concurrent operations.
Durability
The property which guarantees that transactions that have committed will survive permanently. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes.

Concurrency Control

Concurrency control
ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible.
Transaction
A transaction comprises a unit of work performed within a database management system (or similar system) against a database, and treated in a coherent and reliable way independent of other transactions. Transactions in a database environment have two main purposes: To provide reliable units of work that allow correct recovery from failures and keep a database consistent even in cases of system failure, when execution stops (completely or partially) and many operations upon a database remain uncompleted, with unclear status. To provide isolation between programs accessing a database concurrently. If this isolation is not provided, the program’s outcome are possibly erroneous.

Indexes

Indexes
While not part of the SQL standard, commercial SQL systems allow the declaration of indexes on attributes; the indexes speed up certain queries that involve specification of value, or range of values, for the indexed attribute(s).
Choosing indexes
While indexes speed up queries, they slow down database modifications, since the indexes on the modified relation must also be modified. Thus, the choice of indexes is a complex problem, depending on the actual mix of queries and modifications perfumed on the database.
Automatic index selection
Some DBMS’s offer tools to choose indexes for a database automatically. They examine the typical queries and modifications performed on the database and evaluate the cost trade-off for different indexes that might be created.

November 5th Transactions and Concurrency Control

November 7th Transactions and Concurrency Control

November 12th Quiz on Concurrency Control & Indexing + NoSQL intro lecture

November 14th NoSQL class

November 19 Redemption Quiz

Study

How to use functional dependancies to determine keys

  • If an attribute only appears on the left hand side of a set of dependancies, it is part of every key.
  • If an attribute only appears on the right hand side of a set of dependancies, it is part of not key.
  • If it is in middle, it could be but might not be.
  • Consider:
    R(ABCD)
    F { AB -> C, C -> B, C -> D }
    
Left Middle Right
A BC D

A is not the key because it does not determine every attribute.

  • AB is a key, but we also have the try the other attribute.
  • Let’s look at AC+ closure
  • Consider:
    R(ABC)
    
    FD = { A -> B, B -> C, C -> A }
    
  • Everything is in the middle!
    • We have to try the attributes one at a time, then two at a time, then three at a time.
      A+ = ABC
      B+ = ABC
      C+ = ABC
      
P NP
ABCD NA

First and Second Normal Forms

  • What is a normal form?
    • Informally, it is a way of formatting a database.
    • The higher the form, the more constraints there are.
1NF
  • All attributes are atomic.
  • One value per cell.
  • Consider:
    Course Student
    Database Bob
    Joe
    Sue
    Math Tim
    Mary
  • To make this in INF:
    Course Student
    Database Bob
    Database Joe
    Database Sue
    Math Tim
    Math Mary
2NF

Let $R$ be a relation schema with FDs $F$. Let $X$ be a set of attributes of $R$ and $A$ is a non-prime attribute not in $X$. Then $R$ is in 2NF if whenever $X \to A \in F+$, then $X$ is not a proper subset of any key.

  • “How do we know if a relation table is in 2NF?” Consider:
    R(ABCDE)
    F = { ABD -> C, BC -> D, CD -> E }
    
    Left Middle Right
    AB CD E
    • AB+ = { }
    • ABC+ = { ABCDE }
    • ABD+ = { ABCDE }
    Prime Non-Prime
    ABCD E
    • By our definition of 2NF, the way you violate 2NF, you have a FD where the right-hand side is non-prime and not part of $X$ and left hand side is a proper subset of a key.
      • Our keys our ABC and ABD

Third Normal Form and Database Design Shortcuts

  • If $A$ is non-prime, $X$ has to be a super key.

Quiz 1

Quiz 2

Quiz 5

Question 1

Consider a relation $R = (ABCD)$, for each of the following sets of functional dependencies, do the following (4 questions for each set of FDs):

  1. Compute.1
  2. Specify all of the candidate keys.
  3. Identify whether R is in BCNF?
  4. If R is not in BCNF, decompose it into a collection of BCNF relations.
  1. $FD = \lbrace B \to C, D \to A \rbrace$
    1. Compute AB+
      • This is a closure, which means see what you “can get with” A and B.
      • You get AB by reflexivity.
      • You get C by the fact you have B.
      • AB+ = { A, B, C }
    2. Specify all of the candidate keys.
      • Consider:
        Left Middle Right
        B D A C
      • The only key is BD.
    3. Identify whether R is in BCNF?
      • You can check to see if every closure of a left hand side of a FD is equal to all attributes in the relation.
      • $B+ = BC$
        • Not a superkey.
      • $D+ = DA$
        • Not a superkey.
    4. If R is not in BCNF, decompose it into a collection of BCNF relations.i
      • For B to C, R(BC)
        • B+ = BC, {B -> C} (In BCNF)
        • -B+ = ABD, {A -> D} (Not in BCNF)
      • For D to A
        • D+ = AD, {D -> A} (In BCNF)
        • -D+ = DBC, {B -> C} (Not in BCNF)
          • For B to C, (BC) and (BD) {} (In BCNF)
      • {BC, AD, BD}
  2. $FD = \lbrace C \to D, C \to A, B \to C \rbrace$
    1. $AB+ = ABCD$
    2. For keys, consider:
      Left Middle Right
      B C AD
      • B is every key.
      • C could be in any key.
      • AD are in no keys.
      • B is the only key.
    3. It is not in BCNF because none of the left hand sides in the FDs are superkeys.
    4. Lets decompose it to BCNF relations:
      • $C \to D$ violates
        • LSH+ = ACD {C -> D, C -> A}, in BCNF
        • LSH- = BC, {B -> C}, in BCNF
      • The decompisition is {ACD, BC}
  3. $FD = \lbrace ABC \to D, D \to A \rbrace$
    1. $AB+ = AB$
    2. For keys, consider:
      Left Middle Right
      BC AD
      • It can be BCD or ABC.

Quiz 6

Question 1

For each of the following schedules, state whether they are serializable and whether they are conflict-serializable. For each schedule, draw the corresponding precedence graph. If the schedule is conflict-serializable, show all the conflict-equivalent serial schedules. If the schedule is not serializable, provide an example to show that no serial schedule has the same effect as given schedule.

  1. S = r1 (A); r2 (A); w1 (A); w2 (A)
    $T_1$ $T_2$
    r(A)
    r(A)
    w(A)
    w(A)
    1. For each transaction $T_i$ participating in schedule $S$, create a node labelled $T_i$ in the precedence graph.
    2. For each case in $S$ where $T_i$ executes a write_item(X) then $T_j$ executes a read_item(X), create an edge ($T_i \to T_j$) in the precedence graph.
    3. For each case in $S$ where $T_i$ executes a read_item(X) then $T_j$ executes a write_item(X), create an edge ($Ti \to Tj$) in the precedence graph. This will bring to front a directed graph from $T_1$ to $T_2$.
    4. For each case in $S$ where $T_i$ executes a write_item(X) then $T_j$ executes a write_item(X), create an edge ($T_i \to T_j$) in the precedence graph.
    5. The schedule $S$ is serializable if the precedence graph has no cycles.
    • In short, if there is a write then a read, a read then a write, or a write then a write, that is anything to do with a write, for any given two times, add an edge.
    • I’m seeing nothing wrong with the first two rows of the table above, but then there is read and then a write, and then a write then a write.
      • That makes a cycle. Not conflict serializable.
  2. $S = r_1 (A), r_2 (B), w_3 (A), r_2 (A), r_1 (B)$
    $T_1$ $T_2$ $T_3$
    r(A)
    r(B)
    w(a)
    r(A)
    r(B)

    Precedence graph for question 2

    • No cycles, no conflict serializable.i
    • I think, to work out equivilient schedules, move around the times and check the conflicts again.
    • Conflict-serializability implies serializability.

Question 2

Show all the conflict-serializable schedules of the following transactions T1 and T2. Explain why these are all such schedules.

T1: r1 (A) w1 (A) r1 (B) w1 (B)

T2: r2 (B) w2 (B) r2 (A) w2 (A).

What are the conflict equivilent schedules?

  • Conflict-equivalent to T1, T2:
    r1 (A) w1 (A) r1(B) w1(B) r2(B) w2(B) r2(A) w2(A) 
    
    • Reason: This is the serial schedule T1, T2.
    • There are no other conflict-equivalent schedules, since there is no non-conflicting swap of adjacent actions in this schedule: we cannot swap actions within a transaction, and w1 (B) r2 (B) cannot be swapped, because they conflict (both manipulating the same database element and one of them is a write).
  • Conflict-equivalent to T2, T1:
    r2(B) w2(B) r2(A) w2(A) r1(A) w1(A) r1(B) w1(B)
    

December 5th Class Projects Presentations

December 10th Project Finalists and Crowning of the Winner


  1. Compute what? It seems to be AB+