The information in this article applies to:
Note: Requires knowledge of creating tables.
One of the most important steps in designing a database is ensuring that the data is properly distributed among its tables. With proper data structures, the remainder of the application (the queries, forms, reports, code, etc.) is significantly simplified. Making sure this foundation is properly designed is critical to the creation of an application and the ability to scale and enhance it over time.
The formal name for proper table design is database normalization. This article is an overview of the basic database normalization concepts and some common pitfalls to consider and avoid.
Before proceeding with table design, it's important to understand what you're planning to do with your data and how it will change over time. The assumptions you make impact the eventual design.
When designing an application, it's critical to understand the final results to ensure that you have all the necessary data and know where it comes from. For instance, what is the appearance of the reports, where does each piece of data come from, and do they all exist? Nothing is more damaging to a project than the realization, late in the process, that data is missing for an important report.
Once you know what data you need, you must determine where it comes from. Is the data imported from another source? Does that data need to be cleaned or verified? Does the user enter data?
Having a firm grasp of what data is required and where it comes from is the first step in database design.
Will your users need to edit the data, and if so, how should the data be displayed for them to understand and edit? Are there validation rules and related lookup tables? Are there auditing issues associated with data entry that require keeping backups of edits and deletes? What kind of summary information needs to be displayed to the user? Do you need to generate export files? With this information, you can envision how the fields are related to each other.
Group your data into related fields (e.g. customer related information, invoice related information, etc.). Each group of fields represents future tables. You should then consider how they are related to each other. For instance, what tables are related in a one-to-many relationship (e.g. one customer may have multiple invoices) and what's a one-to-one relationship (often a consideration to combine into one table).
After the tables are designed, the impact of time is often not considered and causes huge problems later. Many table designs work perfectly fine for immediate use. However, many designs breakdown as users modify the data, new data gets added, and time passes. Often, developers find they need to restructure their tables to accommodate these changes. When table structures change, all their dependencies (queries, forms, reports, code, etc.) also need to be updated. By understanding and anticipating change over time, a better design can be implemented to minimize the problems.
Understanding how you are going to analyze and manipulate the data is also important. You should have a firm grasp of how queries work, how to use them to link data across multiple tables, how to use them to group and summarize data, and the use of crosstab queries when you need to display data in non-normalized format.
Ultimately, the goal of good data design is to balance the needs of storing the data efficiently over time versus easily retrieving and analyzing it. Understanding the power of queries significantly helps with properly designing your tables.
Rather than a theoretical discussion about database normalization, here are some basic concepts for database normalization. How you apply them in your situation may differ based on the needs of your application. Hopefully, you'll understand these basic concepts, apply them when you can, and understand the issues when you need to deviate from them.
Most database developers understand this basic concept of data normalization. Ideally, you'd like to store the same data in one place and refer to it with an ID when you need to reference it. Therefore, if some information changes, you can change it in one place and the information "ripples" through your application.
For instance, a customer table would store a record for each customer with their name, address, phone numbers, email, and other characteristics. The customer table would have a unique Customer ID field (usually an autonumber field) that is its key field and used by other tables to refer to the customer. Therefore, an invoice table, rather than storing all the customer information with each invoice (because the same customer may have multiple invoices), would simply refer to the Customer ID value which could be used to lookup the customer details in the Customer table. Access makes it very easy to do this through its powerful forms using combo boxes and subforms. If you need to make a change to the customer's information (say their phone number changed), you can change it in the customer table and know that any other part of your application which references that information is automatically updated.
With a properly normalized database, changes to data over time are easily handled with a simple edit. Improperly normalized databases often include programming or queries to make changes across multiple records or tables. This not only requires more work to implement, it opens the possibility that the data becomes inconsistent if the code or queries don't execute or if there's a bug.
Databases should be designed so that over time, you simply add new records. Database tables are designed to hold huge numbers of records. However, if you find you need to add more fields, you probably have a design problem.
This often happens with spreadsheet experts who design databases the way they are accustomed to designing spreadsheets. Time sensitive fields are designed (e.g. Year, Quarter, Product, Salesman, etc.) which require new fields to be added in the future. But the correct design is to transpose the information and have the time sensitive data in one field so more records can be added. For instance, rather than a separate field for each year, create a Year field, and enter the value of each record's year in that field.
The reason adding additional fields is problematic is due to the impact of structural changes to tables on other parts of the application. When more fields are added to a table, the objects and code dependent on the table also need to be updated. For instance, queries need to grab the extra fields, forms need to display them, reports need to include them, etc. However, if the data were normalized, the existing objects would automatically get the new data and calculate and/or display it correctly. Queries are particularly powerful and allow you to group on the Year field to show summaries by year no matter what years are in your table.
Data normalization does not mean, however, that you can't display or use data with time sensitive/dependent fields. Those who need to show and display such information can often do so with Crosstab queries. If you are not already, familiarize yourself with crosstab queries and how to use them. They are not the same as tables (in particular, you cannot edit the results of a crosstab), but they can certainly be used for displaying information in a datasheet (up to 255 fields). If you want to use them in reports, it's more complicated because your report will need to accommodate the additional or changing field names. That's why most reports will show data as separate groupings within the report rather than as separate columns. For those instances where you have no choice, you'll have to invest the time to support this, but hopefully all parties will understand the implication such decisions have on additional resources over time.
So, that's why additional records are free (the big advantage of databases) and additional fields so expensive. Databases can accommodate massive amounts of change….if designed properly.
Sometimes, data needs to be de-normalized to preserve information that may change over time.
In our simple example of an invoice linked to the customer table via a Customer ID number, we may need to preserve the customer address at the time the invoice is issued (not created since the customer information may change between the two events). If we did not and we update the customer information in the future, we may not be able to confirm the exact shipping address where a particular invoice was sent. This could be a huge business problem. Of course, some information, like the customer's phone number, may not need to be preserved. Therefore, one should selectively determine what data should be duplicated.
Another example of where data needs to be duplicated is for filling out the line items of an invoice. Often, there's a price list used to pick the items the customer ordered. One could simply store the PriceListID to point to the price list with its product description, price, etc. However, product descriptions and prices change over time. Without copying the data from the price list into the Line Items table, you cannot accurately reprint the original invoice in the future, which can be a big problem if you haven't been paid yet.
So while normalization works great at keeping the same data in one place and simplifies editing, there are situations where such benefits are not desired. If you need a snapshot of your data for historic reasons, it's critical you design it into your database at the beginning. Otherwise, once the data is overwritten, you can't get it back.
For efficiency, each table should have a key field. The key field defines uniqueness in the table and is used by indexes on its other fields to improve performance of searches. For instance, the customer table could have a CustomerID field that defines a unique number for each customer. For the purposes of this discussion, we are considering tables that have multiple fields and not a simple single table lookup, such as a list of countries.
In general, a key field should have these characteristics:
Should be One Field
It is possible to define multiple fields as the key fields of a table, but a single field is preferable. First, if multiple fields are necessary to define uniqueness, it takes up more space to store the key than a single numeric field. Second, additional indexes on the table also have to use the combination of the key fields which takes up more space than if it were a single field. Finally, identifying records in the table requires grabbing a combination of fields. Far better to have a CustomerID number than a combination of other fields to define a customer.
Should be Numeric
Access offers an AutoNumber field that is a Long Integer which is ideal for key fields. These values are automatically unique for each record and supports multi-user data entry as well.
Should Not Change Over Time
A key field should not change over time. Once identified, like a social security number, it should never change. A key field that changes makes it very difficult to use historic data because the links break.
Should be Meaningless
To ensure a key field doesn't change over time, it should have no meaning and therefore no reason to change over time. A meaningless key value is also helpful if there are situations where the other data is incomplete. For instance, you can assign a customer number without have someone's complete address. The remainder of your application can work perfectly fine, and you can add the information when you get it. If your table used as part of it's key, the country or some other identifying field you didn't have, you run the risk of not being able to use your application.
So, for all the reasons listed above, we recommend using an AutoNumber field as the key field for most of your tables. With the use of combo boxes and hidden columns, you can actually bind fields to the AutoNumber ID field and hide it from the user.
Once your tables are defined and you understand how they are related to each other, be sure to add referential integrity to enforce the relationship. This prevents link fields from being modified incorrectly and leaving "orphaned" records. The Jet Engine supports sophisticated referential integrity which allows you to have cascading updates and deletes. In general, you should not be changing the ID field. Therefore, cascading updates are less of an issue, but cascading deletes can be very helpful.
For instance, if you have an Invoice table related to an Orders table where one invoice can have an unlimited number of orders (line items) with each order record containing the invoice number it is linked to, cascading deletes allow you to delete the invoice record and automatically delete all its corresponding Order records. That ensures you never have an Order record without a corresponding Invoice record.
We hope you'll be able to apply these database design concepts early in your application design to minimize the many problems and remedies required when such designs are not implemented. Good luck.
For more information, check out our other resources: