Microsoft Access and SQL Server Database Normalization Tips
by
Luke Chung
President of FMS, Inc.
The information in this article applies to:
- Microsoft Access (all versions)
- SQL Server
Novice: Requires knowledge of creating tables.
Introduction
One of the most important steps in designing a database is ensuring
that the data is properly distributed among its tables. With proper data
structures, the remainder of the application (the queries, forms,
reports, code, etc.) is significantly simplified. Making sure this
foundation is properly designed is critical to the creation of an
application and the ability to scale and enhance it over time.
The formal name for proper table design is database normalization.
This article is an overview of the basic database normalization concepts
and some common pitfalls to consider and avoid.
Understanding Your Data
Before proceeding with table design, it's important to understand
what you're planning to do with your data and how it will change
over time. The assumptions you make impact the eventual design.
What Data Do You Need?
When designing an application, it's critical to understand the final
results to ensure that you have all the necessary data and know
where it comes from. For instance, what is the appearance of the
reports, where does each piece of data come from, and do they all
exist? Nothing is more damaging to a project than the realization,
late in the process, that data is missing for an important report.
Once you know what data you need, you must determine where it comes
from. Is the data imported from another source? Does that data
need to be cleaned or verified? Does the user enter data?
Having a firm grasp of what data is required and where it comes from
is the first step in database design.
What Are You Going to Do with the Data?
Will your users need to edit the data and, if so, how should the
data be displayed for them to understand and edit? Are there
validation rules and related lookup tables? Are there auditing
issues associated with data entry that require keeping backups of
edits and deletes? What kind of summary information needs to be
displayed to the user? Do you need to generate export files? With
this information, you can envision how the fields are related to
each other.
How Is Your Data Related to Each Other?
Group your data into related fields (e.g. customer related
information, invoice related information, etc.). Each group of
fields represents future tables. You should then consider how they
are related to each other. For instance, what tables are related in
a one-to-many relationship (e.g. one customer may have multiple
invoices) and what's a one-to-one relationship (often a
consideration to combine into one table).
What Is Going to Happen
to the Data Over Time?
After the tables are designed, the impact of time is often not
considered and causes huge problems later. Many table designs work
perfectly fine for immediate use. However, many designs breakdown as
users modify the data, new data gets added, and time passes. Often,
developers find they need to restructure their tables to accommodate
these changes. When table structures change, all their dependencies
(queries, forms, reports, code, etc.) also need to be updated. By
understanding and anticipating change over time, a better design can
be implemented to minimize the problems.
Learn How to Use Queries
Understanding how you are going to analyze and manipulate the data
is also important. You should have a firm grasp of how queries work,
how to use them to link data across multiple tables, how to use them
to group and summarize data, and the use of crosstab queries when
you need to display data in non-normalized format.
Ultimately, the goal of good data design is to balance the needs of
storing the data efficiently over time versus easily retrieving and
analyzing it. Understanding the power of queries significantly helps
with properly designing your tables.
Database Normalization Concepts
Rather than a theoretical discussion about database normalization,
here are some basic concepts for database normalization. How you
apply them in your situation may differ based on the needs of your
application. Hopefully, you'll understand these basic concepts,
apply them when you can, and understand the issues when you need to
deviate from them.
Store Unique Information in One Place
Most database developers understand this basic concept of data
normalization. Ideally, you'd like to store the same data in one
place and refer to it with an ID when you need to reference it.
Therefore, if some information changes, you can change it in one
place and the information "ripples" through your application.
For instance, a customer table would store a record for each customer
with their name, address, phone numbers, email, and other
characteristics. The customer table would have a unique Customer ID
field (usually an autonumber field) that is its key field and used by
other tables to refer to the customer. Therefore, an invoice table,
rather than storing all the customer information with each invoice
(because the same customer may have multiple invoices), would simply
refer to the Customer ID value which could be used to lookup the
customer details in the Customer table. Access makes it very easy to do
this through its powerful forms using combo boxes and subforms. If you
need to make a change to the customer's information (say their phone
number changed), you can change it in the customer table and know that
any other part of your application which references that information is
automatically updated.
With a properly normalized database, changes to data over time are
easily handled with a simple edit. Improperly normalized databases often
include programming or queries to make changes across multiple records
or tables. This not only requires more work to implement, it opens the
possibility that the data becomes inconsistent if the code or queries
don't execute or if there's a bug.
Records are Free, New Fields are Expensive
Databases should be designed so that over time, you simply add new
records. Database tables are designed to hold huge numbers of records.
However, if you find you need to add more fields, you probably have a
design problem.
This is often happens with spreadsheet experts who design databases
the way they are accustomed to designing spreadsheets. Time sensitive
fields are designed (e.g. Year, Quarter, Product, Salesman, etc.) which
require new fields to be added in the future. But the correct design is
to transpose the information and have the time sensitive data in one
field so more records can be added. For instance, rather than a separate
field for each year, create a Year field, and enter the value of each
record's year in that field.
The reason adding additional fields is problematic is due to the
impact of structural changes to tables on other parts of the
application. When more fields are added to a table, the objects and code
dependent on the table also need to be updated. For instance, queries
need to grab the extra fields, forms need to display them, reports need
to include them, etc. However, if the data were normalized, the existing
objects would automatically get the new data and calculate and/or
display it correctly. Queries are particularly powerful and allow you to
group on the Year field to show summaries by year no matter what years
are in your table.
Data normalization does not mean, however, that you can't display or
use data with time sensitive/dependent fields. Those who need to show
and display such information can often do so with Crosstab queries. If
you are not already, familiarize yourself with crosstab queries and how
to use them. They are not the same as tables (in particular, you cannot
edit the results of a crosstab), but they can certainly be used for
displaying information in a datasheet (up to 255 fields). If you want to
use them in reports, it's more complicated because your report will need
to accommodate the additional or changing field names. That's why most
reports will show data as separate groupings within the report rather
than as separate columns. For those instances where you have no choice,
you'll have to invest the time to support this, but hopefully all
parties will understand the implication such decisions have on
additional resources over time.
So, that's why additional records are free (the big advantage of
databases) and additional fields so expensive. Databases can accommodate
massive amounts of change….if designed properly.
Know When Data Needs to Be Duplicated
Sometimes, data needs to be de-normalized to preserve information
that may change over time.
In our simple example of an invoice linked to the customer table via
a Customer ID number, we may need to preserve the customer address at
the time the invoice is issued (not created since the customer
information may change between the two events). If we did not and we
update the customer information in the future, we may not be able to
confirm the exact shipping address where a particular invoice was sent.
This could be a huge business problem. Of course, some information, like
the customer's phone number, may not need to be preserved. Therefore,
one should selectively determine what data should be duplicated.
Another example of where data needs to be duplicated is for filling
out the line items of an invoice. Often, there's a price list used to
pick the items the customer ordered. One could simply store the
PriceListID to point to the price list with its product description,
price, etc. However, product descriptions and prices change over time.
Without copying the data from the price list into the Line Items table,
you cannot accurately reprint the original invoice in the future, which
can be a big problem if you haven't been paid yet.
So while normalization works great at keeping the same data in one
place and simplifies editing, there are situations where such benefits
are not desired. If you need a snapshot of your data for historic
reasons, it's critical you design it into your database at the
beginning. Otherwise, once the data is overwritten, you can't get it
back.
Use a Meaningless Field for the Key Field
For efficiency, each table should have a key field. The key field
defines uniqueness in the table and is used by indexes on its other
fields to improve performance of searches. For instance, the customer
table could have a CustomerID field that defines a unique number for
each customer. For the purposes of this discussion, we are considering
tables that have multiple fields and not a simple single table lookup,
such as a list of countries.
In general, a key field should have these characteristics:
-
Should be One Field It is possible to define multiple
fields as the key fields of a table, but a single field is
preferable. First, if multiple fields are necessary to define
uniqueness, it takes up more space to store the key than a single
numeric field. Second, additional indexes on the table also have to
use the combination of the key fields which takes up more space than
if it were a single field. Finally, identifying records in the table
requires grabbing a combination of fields. Far better to have a
CustomerID number than a combination of other fields to define a
customer.
-
Should be Numeric Access offers an AutoNumber field
that is a Long Integer which is ideal for key fields. These values
are automatically unique for each record and supports multi-user
data entry as well.
-
Should Not Change Over Time A key field should not
change over time. Once identified, like a social security number, it
should never change. A key field that changes makes it very
difficult to use historic data because the links break.
-
Should be Meaningless To ensure a key field doesn't
change over time, it should have no meaning and therefore no reason
to change over time. A meaningless key value is also helpful if
there are situations where the other data is incomplete. For
instance, you can assign a customer number without have someone's
complete address. The remainder of your application can work
perfectly fine, and you can add the information when you get it. If
your table used as part of it's key, the country or some other
identifying field you didn't have, you run the risk of not being
able to use your application.
So, for all the reasons listed above, we recommend using an
AutoNumber field as the key field for most of your tables. With the use
of combo boxes and hidden columns, you can actually bind fields to the
AutoNumber ID field and hide it from the user.
Use Referential Integrity
Once your tables are defined and you understand how they are related
to each other, be sure to add referential integrity to enforce the
relationship. This prevents link fields from being modified incorrectly
and leaving "orphaned" records. The Jet Engine supports
sophisticated referential integrity which allows you to have cascading
updates and deletes. In general, you should not be changing the ID
field. Therefore, cascading updates are less of an issue, but cascading
deletes can be very helpful.
For instance, if you have an Invoice table related to an Orders table
where one invoice can have an unlimited number of orders (line items)
with each order record containing the invoice number it is linked to,
cascading deletes allow you to delete the invoice record and
automatically delete all its corresponding Order records. That ensures
you never have an Order record without a corresponding Invoice record.
Conclusion
We hope you'll be able to apply these database design concepts early
in your application design to minimize the many problems and remedies
required when such designs are not implemented. Good luck.
Additional Resources
If you need to normalize a lot of un-normalized data, check out our
Total Access
Statistics program which can easily transform data into the
structure you need.
About the Author
Author: Luke Chung,
President and Founder of FMS, Inc. FMS, Inc. (www.fmsinc.com)
is the leading provider of 3rd party products for Microsoft Access users
and developers. Visit the FMS web site for additional Access resources,
utilities, technical papers, demos, and product information.
|