CS377: Database Design - Data Modeling and Normalization
Activity Goals
The goals of this activity are:- To model database dependencies using a formal notation
- To normalize database schemas at varying degrees (first, second, and third normal forms)
The Activity
Directions
Consider the activity models and answer the questions provided. First reflect on these questions on your own briefly, before discussing and comparing your thoughts with your group. Appoint one member of your group to discuss your findings with the class, and the rest of the group should help that member prepare their response. Answer each question individually from the activity, and compare with your group to prepare for our whole-class discussion. After class, think about the questions in the reflective prompt and respond to those individually in your notebook. Report out on areas of disagreement or items for which you and your group identified alternative approaches. Write down and report out questions you encountered along the way for group discussion.Model 1: Determinant / Dependent Relationships
\(SSN, PhoneNumber \longrightarrow Name, PhoneNumber\)
\(SSN \longrightarrow Name\), \(Name \longrightarrow Address\)
Questions
- What is the determinant attribute and the dependent attribute above? In other words, if you know the determinant, you can look up the dependent value?
- What determinant / dependent relationships can you find from the data schema in the ER Modeling Activity?
- Is a primary key a determinant or a dependent? How about records with a foreign key?
- Describe, in your own words, the axiom of augmentation from the second rule above.
- The
Phone Number
field should not be part of the primary key; how can we break up this relationship to create a normalized schema? - Describe the axiom of transitivity, showing that
Address
ultimately depends uponSSN
from the third relationship above.
Model 2: Normalization and Normal Forms
There are many normal forms, and you have been following unnormalized form already because you incorporate a primary key in your tables! We will explore first, second, and third normal forms.
Questions
- The database contains multiple values within one column (
Subject
). Normalize to first normal form (1NF) by re-designing this schema to make Subject a dependency in another table, and thus make theSubject
value atomic (singular valued). - What is the composite primary key of this table? To establish second normal form (2NF), establish a single column primary key. Move the composite key columns into their own table that you can link via a foreign key.
- To establish third normal form (3NF), identify any transitive dependencies, and create separate tables for each group of related columns. This way, no non-key field values depend on one another within the same table.
- Sixth normal form (6NF) states that each row contains a primary key and just one additional column! What is the major drawback that precludes 6NF in practice? What benefit might be obtained if 6NF were achieved?
- Suppose you have a table with a person's entire postal address in a single column. How would you normalize this to 1NF?
Overview of Normalization
Normalization is a process of organizing data in a database to eliminate redundancy and dependencies. It ensures that each attribute in a table depends only on the table’s primary key and not on any other non-key attributes. Normalization is divided into several normal forms, each addressing a specific type or level of data redundancy.
The following are the commonly used normalization forms:
-
First Normal Form (1NF): In this form, the data is organized into tables where each column has a single value, and each row is unique. There should be no repeating groups or arrays of data within a table.
-
Second Normal Form (2NF): In addition to the requirements of 1NF, this form eliminates partial dependencies. A partial dependency occurs when a non-key attribute depends on only a part of the primary key.
-
Third Normal Form (3NF): In addition to the requirements of 2NF, this form eliminates transitive dependencies. A transitive dependency occurs when a non-key attribute depends on another non-key attribute, which itself depends on the primary key.
Normalization helps in improving database performance, reducing data duplication, and maintaining data consistency. By eliminating redundancy, it minimizes the chances of data inconsistency and update anomalies.
ER Modeling in Python Example
We can define entities, relationships, and attributes using Python classes and demonstrate normalization techniques.
from sqlalchemy import Column, Integer, String, ForeignKey
from sqlalchemy.orm import relationship
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Author(Base):
__tablename__ = 'authors'
id = Column(Integer, primary_key=True)
name = Column(String)
class Book(Base):
__tablename__ = 'books'
id = Column(Integer, primary_key=True)
title = Column(String)
author_id = Column(Integer, ForeignKey('authors.id'))
author = relationship('Author', backref='books')
class Order(Base):
__tablename__ = 'orders'
id = Column(Integer, primary_key=True)
book_id = Column(Integer, ForeignKey('books.id'))
book = relationship('Book')
class Customer(Base):
__tablename__ = 'customers'
id = Column(Integer, primary_key=True)
name = Column(String)
order_id = Column(Integer, ForeignKey('orders.id'))
order = relationship('Order')
# Create tables and relationships
Base.metadata.create_all()
In the above example, we defined the entities Author
, Book
, Order
, and Customer
using Python classes. We also defined relationships between entities using foreign keys and established associations using SQLAlchemy.
Database Normalization
Database normalization is a process used to organize a database into tables and columns. The main idea with this is that a table should be about a specific topic and only supporting topics included. The purpose of normalization is to:
- Eliminate redundant data
- Ensure data dependencies make sense
There are three common forms of database normalization: 1st, 2nd, and 3rd normal form. They are also abbreviated as 1NF, 2NF, and 3NF respectively. There are additional levels of normalization as well, which apply stricter rules to establish “normality.”
The tables below can be expressed in ER modeling tools such as SQLAlchemy; however, we will visualize them at a high level to explore how tables, relationships, and keys should be organized at each normalization level.
First Normal Form (1NF)
For a table to be in the First Normal Form, it should follow the following 4 rules:
- It should only have single (atomic) valued attributes/columns.
- Values stored in a column should be of the same domain.
- All the columns in a table should have unique names.
- And the order in which data is stored, does not matter.
Example
Original Table (Not Normalized):
Order_ID | Product_ID | Product_Name | Customer_ID | Customer_Name | Country | Country_Code |
---|---|---|---|---|---|---|
1 | 101 | Product1 | 201 | Alice | USA | +1 |
2 | 102 | Product2 | 202 | Bob | UK | +44 |
3 | 103 | Product3 | 201 | Alice | USA | +1 |
After applying 1NF:
Orders Table
Order_ID | Product_ID | Customer_ID |
---|---|---|
1 | 101 | 201 |
2 | 102 | 202 |
3 | 103 | 201 |
Products Table
Product_ID | Product_Name |
---|---|
101 | Product1 |
102 | Product2 |
103 | Product3 |
Customers Table
Customer_ID | Customer_Name | Country |
---|---|---|
201 | Alice | USA |
202 | Bob | UK |
Second Normal Form (2NF)
For a table to be in the Second Normal Form,
- It should be in the First Normal form.
- And, it should not have Partial Dependency.
A partial dependency occurs when a non-prime attribute (an attribute that doesn’t belong to any candidate key) is dependent on a part of a candidate key.
Example
After applying 2NF:
Orders Table
Order_ID | Product_ID | Customer_ID |
---|---|---|
1 | 101 | 201 |
2 | 102 | 202 |
3 | 103 | 201 |
Products Table
Product_ID | Product_Name |
---|---|
101 | Product1 |
102 | Product2 |
103 | Product3 |
Customers Table
Customer_ID | Customer_Name | Country |
---|---|---|
201 | Alice | USA |
202 | Bob | UK |
Third Normal Form (3NF)
A table is said to be in the Third Normal Form when,
- It is in the Second Normal form.
- And, it does not have Transitive Dependency.
A transitive dependency occurs when a non-prime attribute is dependent on another non-prime attribute.
Example
After applying 3NF:
Orders Table
Order_ID | Product_ID | Customer_ID |
---|---|---|
1 | 101 | 201 |
2 | 102 | 202 |
3 | 103 | 201 |
Products Table
Product_ID | Product_Name |
---|---|
101 | Product1 |
102 | Product2 |
103 | Product3 |
Customers Table
Customer_ID | Customer_Name | Country |
---|---|---|
201 | Alice | USA |
202 | Bob | UK |
Countries Table
Country | Country_Code |
---|---|
USA | +1 |
UK | +44 |
Python example that creates the 3NF tables:
import sqlite3
# Connect to the SQLite database
conn = sqlite3.connect('orders.db')
# Create a cursor object
c = conn.cursor()
# Create table Orders
c.execute('''
CREATE TABLE Orders(
Order_ID INT PRIMARY KEY NOT NULL,
Product_ID INT NOT NULL,
Customer_ID INT NOT NULL);
''')
# Create table Products
c.execute('''
CREATE TABLE Products(
Product_ID INT PRIMARY KEY NOT NULL,
Product_Name TEXT NOT NULL);
''')
# Create table Customers
c.execute('''
CREATE TABLE Customers(
Customer_ID INT PRIMARY KEY NOT NULL,
Customer_Name TEXT NOT NULL,
Country TEXT NOT NULL);
''')
# Create table Countries
c.execute('''
CREATE TABLE Countries(
Country TEXT PRIMARY KEY NOT NULL,
Country_Code TEXT NOT NULL);
''')
# Insert data into Orders table
orders = [(1, 101, 201), (2, 102, 202), (3, 103, 201)]
c.executemany('INSERT INTO Orders VALUES (?,?,?)', orders)
# Insert data into Products table
products = [(101, 'Product1'), (102, 'Product2'), (103, 'Product3')]
c.executemany('INSERT INTO Products VALUES (?,?)', products)
# Insert data into Customers table
customers = [(201, 'Alice', 'USA'), (202, 'Bob', 'UK')]
c.executemany('INSERT INTO Customers VALUES (?,?,?)', customers)
# Insert data into Countries table
countries = [('USA', '+1'), ('UK', '+44')]
c.executemany('INSERT INTO Countries VALUES (?,?)', countries)
# Commit the transaction
conn.commit()
# Close the connection
conn.close()