Sunday, September 9, 2012

DBMS - Normalization & Normal Forms

Normalization is the process of organizing the attributes of database to reduce or eliminate data redundancy (having same data but at different places).

Problems because of data redundancy
Data redundancy unnecessarily increases size of database as same data is repeated on many places. Inconsistency problems also arise during insert, delete and update operations.
Functional Dependency
Functional Dependency is a constraint between two sets of attributes in a relation from a database. Functional dependency is denoted by arrow (→). If an attributed A functionally determines B, then it is written as A → B.
For example employee_id → name means employee_id functionally determines name of employee. As another example in a time table database, {student_id, time} → {lecture_room}, student ID and time determine the lecture room where student should be.
What does functionally dependent mean?
A function dependency A → B mean for all instances of a particular value of A, there is same value of B.
For example in the below table A → B is true, but B → A is not true as there are different values of A for B = 3.
A   B
------
1   3
2   3
4   0
1   3
4   0

Trivial Functional Dependency
X→Y is trivial only when Y is subset of X.
Examples
ABC --> AB
ABC --> A
ABC --> ABC
Non Trivial Functional Dependencies
X→Y is a non trivial functional dependencies when Y is not a subset of X.
X→Y is called completely non-trivial when X intersect Y is NULL.
Examples:
Id --> Name, 
Name --> DOB

Normal forms
Normal forms are used to eliminate or reduce redundancy in database tables.
First Normal Form
A relation is in first normal form if every attribute in that relation is singled valued attribute.
Example :
 
ID   Name   Courses
------------------
1    A      c1, c2
2    E      c3
3    M      C2, c3

In the above table Course is a multi valued attribute so it is not in 1NF.

Below Table is in 1NF as there is no multi valued attribute
ID   Name   Course
------------------
1    A       c1
1    A       c2
2    E       c3
3    M       c1
3    M       c2

Second Normal Form
A relation is in 2NF if it has No Partial Dependency, i.e.no non-prime attribute (attributes which are not part of any candidate key) is dependent on any proper subset of any candidate key of the table. 
For example consider following functional dependencies in relation  R (A,  B , C,  D )
AB -> C  [A and B together determine C]
BC -> D  [B and C together determine D]
In the above relation, AB is the only candidate key and there is no partial dependency, i.e., any proper subset of AB doesn’t determine any non-prime attribute.

Third Normal Form
A relation is in 3NF if at least one of the following condition holds in every non-trivial function dependency X→Y
a) x is a super key.
b) Y is a prime attribute (each element of Y is part of some candidate key).
For example consider relation R(A, B, C, D, E)
  A -> BC,
  CD -> E, 
  B -> D, 
  E -> A
All possible candidate keys in above relation are {A, E, CD, BC}
All attribute are on right sides of all functional dependencies are prime.

BCNF 
A relation is in BCNF if in every non-trivial functional dependency XY, X is a super key.
For example consider relation R(A, B, C) 
  A -> BC, 
  B -> A
A and B both are super keys so above relation is in BCNF.
Key Points
  1. BCNF is free from redundancy.
  2. If a relation is in BCNF, then 3NF is also also satisfied.
  3.  If all attributes of relation are prime attribute, then the relation is always in 3NF.
  4. A relation in a Relational Database is always and at least in 1NF form.
  5. Every Binary Relation ( a Relation with only 2 attributes ) is always in BCNF.
  6. If a Relation has only singleton candidate keys( i.e. every candidate key consists of only 1 attribute), then the Relation is always in 2NF( because no Partial functional dependency possible).
  7. Sometimes going for BCNF form may not preserve functional dependency. In that case go for BCNF only if the lost FD(s) is not required, else normalize till 3NF only.
  8. There are many more Normal forms that exist after BCNF, like 4NF and more. But in real world database systems it’s generally not required to go beyond BCNF.
Steps to find the highest normal form of a relation:
  1. Find all possible candidate keys of the relation.
  2. Divide all attributes into two categories: prime attributes and non-prime attributes.
  3. Check for 1st normal form then 2nd and so on. If it fails to satisfy nth normal form condition, highest normal form will be n-1.
Example 1. Find the highest normal form of a relation R(A,B,C,D,E) with FD set {A→D, B→A, BC→D, AC→BE}
Step 1.   As we can see, (AC)+ ={A,C,B,E,D}  but none of its subset can determine all attribute of relation, So AC will be candidate key. A can be derived from B, so we can replace A in AC by B. So BC will also be a candidate key. So there will be two candidate keys {AC, BC}.
Step 2.  Prime attribute are those attribute which are part of candidate key {A,B,C} in this example and others will be non-prime {D,E} in this example.
Step 3.  The relation R is in 1st normal form as a relational DBMS does not allow multi-valued or composite attribute.
The relation is not in 2nd Normal form because AD is partial dependency (A which is subset of candidate key AC is determining non-prime attribute D) and 2nd normal form does not allow partial dependency.
So the highest normal form will be 1st Normal Form.

Example 2. Find the highest normal form of a relation R(A,B,C,D,E) with FD set as {BCD, ACBE, BE}
Step 1.   As we can see, (AC)+ ={A,C,B,E,D}  but none of its subset can determine all attribute of relation, So AC will be candidate key. A or C can’t be derived from any other attribute of the relation, so there will be only 1 candidate key {AC}.
Step 2.  Prime attribute are those attribute which are part of candidate key {A,C} in this example and others will be non-prime {B,D,E} in this example.
Step 3.  The relation R is in 1st normal form as a relational DBMS does not allow multi-valued or composite attribute.
The relation is in 2nd normal form because BCD is in 2nd normal form (BC is not proper subset of candidate key AC) and ACBE is in 2nd normal form (AC is candidate key) and BE is in 2nd normal form (B is not a proper subset of candidate key AC).
The relation is not in 3rd normal form because in BCD (neither BC is a super key nor D is a prime attribute) and in BE (neither B is a super key nor E is a prime attribute) but to satisfy 3rd normal for, either LHS of an FD should be super key or RHS should be prime attribute.
So the highest normal form of relation will be 2nd Normal form.

Example 3. Find the highest normal form of a relation R(A,B,C,D,E) with FD set {BA, AC, BCD, ACBE}
Step 1.   As we can see, (B)+={B,A,C,D,E}, so B will be candidate key. B can be derived from AC using AC→B (Decomposing ACBE to ACB and ACE). So AC will be super key but (C)+={C} and (A)+={A,C,B,E,D}. So A (subset of AC) will be candidate key. So there will be two candidate keys {A,B}.
Step 2.  Prime attribute are those attribute which are part of candidate key {A,B} in this example and others will be non-prime {C,D,E} in this example.
Step 3.  The relation R is in 1st normal form as a relational DBMS does not allow multi-valued or composite attribute.
The relation is in 2nd normal form because BA is in 2nd normal form (B is a super key) and AC is in 2ndnormal form (A is super key) and BCD is in 2nd normal form (BC is a super key) and ACBE is in 2nd normal form (AC is a super key).
The relation is in 3rd normal form because LHS of all FD’s are super keys. The relation is in BCNF as all LHS of all FD’s are super keys. So the highest normal form is BCNF.