CS377: Database Design - Data Integrity

Activity Goals

The goals of this activity are:
  1. To differentiate between domain and integrity constraints

The Activity

Directions

Consider the activity models and answer the questions provided. First reflect on these questions on your own briefly, before discussing and comparing your thoughts with your group. Appoint one member of your group to discuss your findings with the class, and the rest of the group should help that member prepare their response. Answer each question individually from the activity, and compare with your group to prepare for our whole-class discussion. After class, think about the questions in the reflective prompt and respond to those individually in your notebook. Report out on areas of disagreement or items for which you and your group identified alternative approaches. Write down and report out questions you encountered along the way for group discussion.

Model 1: Domain and Integrity Constraints

Questions

  1. The domain of the SMOKER attribute is all positive integers, but it is really intended to be 0 or 1. This is an integrity constraint since it cannot be restricted syntactically by the domain (although some SQL engines do support boolean data types!). What other integrity constraints do you see in this table?
  2. Describe a potential referential integrity violation that could occur within this database if records can be deleted from the EMPLOYEE table without checking the SPOUSE table first.
  3. What should be done in the EMPLOYEE table if an entry is deleted in the SPOUSE table?
  4. Draw this schema, and indicate the cardinality of each relationship (1:1, 1:many, optional).
  5. What inefficiency exists within this table? Could someone be a spouse of one person and a child of another? Re-design this schema to eliminate redundant data storage.
  6. What do you think a CHECK Constraint does?

Data Integrity

Data integrity refers to the accuracy, completeness, consistency, and reliability of data throughout its lifecycle. It ensures that data remains unchanged, complete, and valid from creation to deletion. Maintaining data integrity is crucial as it supports data quality, enhances decision-making, and promotes trust in the data.

Data integrity is essential in various domains, including finance, healthcare, e-commerce, and scientific research. Here are some reasons why data integrity is important:

  1. Trustworthiness: Ensuring data integrity builds trust among users who rely on the accuracy and consistency of data for decision-making.
  2. Compliance: Many industries have regulatory requirements for data integrity, such as HIPAA in healthcare or Sarbanes-Oxley Act in finance.
  3. Efficiency: Accurate and reliable data streamlines business processes, reduces errors, and improves efficiency.
  4. Data Analysis: High data integrity enables accurate and meaningful analysis, leading to more informed decisions and insights.

Maintaining Data Integrity

To maintain data integrity, several techniques and practices can be employed:

  1. Data Validation: Validate data during entry to ensure it meets predefined criteria. This can involve techniques such as data type checks, format validations, and range validations.
  2. Access Controls: Implement access controls and user permissions to prevent unauthorized modifications to data.
  3. Backup and Recovery: Regularly backup data and establish recovery procedures to restore data in case of accidental or intentional corruption.
  4. Version Control: Use version control systems to track changes made to data and provide mechanisms for reverting to previous versions if necessary.
  5. Hash Functions: Utilize cryptographic hash functions to generate a unique hash value for data. Comparisons of hashes can be used to verify data integrity.

Elementary Data Integrity

Profiling and Summarizing Data in Python Using Pandas

We can search for missing values per the following example:

import pandas as pd

# Load dataset
df = pd.read_csv("dataset.csv")

# Data profiling
print("Data Shape:", df.shape)
print("Data Summary:")
print(df.describe())
print("Missing Values:")
print(df.isnull().sum())

Data Cleansing in Python Using Pandas

We can remove duplicate data from a pandas dataframe as follows:

import pandas as pd

# Load dataset
df = pd.read_csv("dataset.csv")

# Remove duplicate entries
df.drop_duplicates(inplace=True)

# Save cleansed dataset
df.to_csv("cleansed_dataset.csv", index=False)

Ensuring Data Integrity with Python

Python offers various libraries and techniques to maintain data integrity. Here are a few examples:

Data Validation with pydantic

pydantic is a lightweight library that provides runtime data validation and parsing for Python.

Example:

from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int

user_data = {"name": "John", "age": "25"}
user = User(**user_data)  # Raises a validation error due to an invalid age field

Hash Functions with the hashlib Module

The hashlib module in Python provides access to various secure hash and message digest algorithms.

Example:

import hashlib

def calculate_hash(data):
    hasher = hashlib.sha256()
    hasher.update(data.encode())
    return hasher.hexdigest()

original_data = "Hello, World!"
hash_value = calculate_hash(original_data)  # Calculate hash value
print(hash_value)

modified_data = "Hello, World!!"
modified_hash_value = calculate_hash(modified_data)  # Calculate hash value
print(modified_hash_value)

print(hash_value == modified_hash_value)  # False, indicating data modification

Submission

I encourage you to submit your answers to the questions (and ask your own questions!) using the Class Activity Questions discussion board. You may also respond to questions or comments made by others, or ask follow-up questions there. Answer any reflective prompt questions in the Reflective Journal section of your OneNote Classroom personal section. You can find the link to the class notebook on the syllabus.