I'm new to database design and I have been reading quite a bit about normalization. If I had three tables: Accommodation, Train Stations and Airports. Would I have address columns in each table or an address table that is referenced by the other tables? Is there such a thing as over-normalization?

Thanks


Solution 1:

Database Normalization is all about constructing relations (tables) that maintain certain functional dependencies among the facts (columns) within the relation (table) and among the various relations (tables) making up the schema (database). Bit of a mouth-full, but that is what it is all about.

A Simple Guide to Five Normal Forms in Relational Database Theory is the classic reference for normal forms. This paper defines in simple terms what the essence of each normal form is and its significance with respect to database table design. This is a very good "touch-stone" reference.

To answer your specific question properly requires additional information. Some critical questions you have to ask are:

  • Is an Address a simple fact (e.g. blob of text) or a composite fact (e.g. composed of multiple attributes: Address line, City Name, Postal Code etc.)
  • What are the other "facts" relating to "Accommodation", "Airport" and "Train Station"?
  • What sets of "facts" uniquely and minimally identify an "Airport", an "Accommodation" and a "Train Station" (these facts are typically called a key or candidate key)?
  • What functional dependencies exist among Address facts and the facts composing each relations key?

All this to say, the answer to your question is not as straight forward as one might hope for!

Is there such a thing as "over normalization"? Maybe. This depends on whether the functional dependencies you have identified and used to build your tables are of significance to your application domain.

For example, suppose it was determined that an address was composed of multiple attributes; one of which is postal code. Technically a postal code is a composite item too (at least Canadian Postal Codes are). Further normalizing your database to recognize these facts would probably be an over-normalization. This is because the components of a postal code are irrelevant to your application and therefore factoring them into the database design would be an over-normalization.

Solution 2:

For addresses, I would almost always create a separate address table. Not only for normalization but also for consistency in fields stored.

As for such a thing as over-normalization, absolutely there is! It's hard to give you guidance on what is and isn't over-normalization as I think it mostly comes from experience. However, follow the books on each level of normalization and then once it starts to get difficult to see where things are you've probably gone too far.

Look at all the sample/example databases you can as well. They will give you a good indication on when you should be splitting out data and when you shouldn't.

Also, be well aware of the type and amount of data you're storing, along with the speed of access, etc. A lot of modern web software is going fully de-normalized for many performance and scalability reason. It's worth looking into those for reason why and when you should and shouldn't de-normalize.