CS377: Database Design - Aggregation in Relational Databases
Activity Goals
The goals of this activity are:- To explain the use of primary and foreign keys in database systems
- To aggregate records across multiple tables with referential keys
Supplemental Reading
Feel free to visit these resources for supplemental background reading material.The Activity
Directions
Consider the activity models and answer the questions provided. First reflect on these questions on your own briefly, before discussing and comparing your thoughts with your group. Appoint one member of your group to discuss your findings with the class, and the rest of the group should help that member prepare their response. Answer each question individually from the activity, and compare with your group to prepare for our whole-class discussion. After class, think about the questions in the reflective prompt and respond to those individually in your notebook. Report out on areas of disagreement or items for which you and your group identified alternative approaches. Write down and report out questions you encountered along the way for group discussion.Model 1: Tables
Questions
- How might you compare database tables and records, with software classes and objects?
- How can you differentiate between the two students named Lee?
- Not all cultures use "first" and "last" names; what might be a better choice for these column names?
Model 2: Primary and Foreign Keys
Questions
- What is unfortunate about the structure of these tables? What could we do to improve upon it?
- Modify the tables above to add a "serial number" or a primary key to each of the tables.
- Can you think of a way to identify courses without having to number them? What might a more "natural key" be?
- What disadvantage could arise from using the data itself (like "CS377") as a key, particularly if that data changes over time?
- Eliminate the redundant data in these tables by replacing any redundancies with a new column and row specifying the appropriate key value that can be looked up in the corresponding table. This is called a foreign key.
Model 3: Aggregation
Questions
- Using your new table structure, describe an algorithm to determine the average age of students enrolled in CS377.
- What other types of aggregation functions might be useful across these tables?
- Extend the course database to show enrollments in a particular semester; once the class agrees on a design, draw out the tables and data including the key values.
Processing and aggregating data records is a fundamental aspect of data analysis and management. In this report, we will explore the concept of processing and aggregating data records, focusing on how it is done in databases and using the SQL language. We will discuss various techniques and algorithms used for efficient processing and aggregation and provide Python examples to illustrate the concepts.
Database Introduction
SQL (Structured Query Language) is a domain-specific language designed for managing data in a relational database management system (RDBMS). SQL provides various operations for processing and aggregating data records, including filtering, joining, sorting, and grouping.
SQL Aggregating Functions
SQLite provides a range of aggregating functions, such as COUNT, SUM, AVG, MIN, and MAX. These functions can be applied to columns or expressions, and they enable us to perform calculations on groups of rows or the entire result set.
For example, to calculate the total number of records in a table, we can use the COUNT function as follows:
SELECT COUNT(*) FROM table_name;
To compute the average value of a column, we can use the AVG function:
SELECT AVG(column_name) FROM table_name;
Python Examples
Here are some Python examples demonstrating how to process and aggregate data records using pandas, a popular data manipulation library:
import pandas as pd
# Load data from a CSV file
data = pd.read_csv('data.csv')
# Filtering data
filtered_data = data[data['age'] > 30]
# Aggregating data
aggregated_data = data.groupby('category').agg({'value': 'sum', 'count': 'mean'})
In the first example, we filter the data records to only include those where the age is greater than 30. In the second example, we aggregate the data by category, calculating the sum of the ‘value’ column and the mean of the ‘count’ column.
Conclusion
Processing and aggregating data records is a crucial task in data analysis and management. SQL provides powerful tools for performing these operations within databases, while Python libraries like pandas offer similar functionality for working with structured data outside of databases. By efficiently processing and aggregating data, we can gain valuable insights and make informed decisions based on the information stored in our datasets.