What is combining repeating sets of row information into new entities called when doing database normalization?
I'm a bit confused about a certain piece of database normalization and thought I'd ask StackOverflow:
Imagine you have the following relations that relate products to colors. Notice that Product 1 and Product 2 both use the same set of colors (Blue and Green).
Product_Color Color
+-------------+-------------+ +-------------+-------------+
| Product* | Color* | | ColorId* | Name |
+-------------+-------------+ +-------------+-------------+
| 1 | 1 | | 1 | Blue |
| 1 | 2 | | 2 | Green |
| 2 | 1 | +-------------+-------------+
| 2 | 2 |
+-------------+-------------+
If I create two new relations, ColorSet and ColorSet_Color, I can display the same information by joining the 4 relations together.
Product_ColorSet: ColorSet_Color:
+-------------+-------------+ +-------------+-------------+
| Product* | ColorSetId* | | ColorSetId* | ColorId* |
+---------------------------+ +-------------+-------------+
| 1 | 1 | | 1 | 1 |
| 2 | 1 | | 1 | 2 |
+-------------+-------------+ +---------- --+-------------+
ColorSet: Color:
+-------------+ +-------------+-------------+
| ColorSetId* | | ColorId* | Name |
+-------------+ +-------------+-------------+
| 1 | | 1 | Blue |
| 2 | | 2 | Green |
+-------------+ +----------[--+-------------+
At this point if I had a large Product_Color table, with a reasonable degree of shared groups of colors, I would stand to gain considerably from a space perspective.
What is the technical name for this operation in the context of database normalization? I'm clearly removing redundant information even though the entity I've created doesn't actually exist, it's rather more just random chance that there is a lot of overlap. What specifically am I changing by doing this?
Furthermore, it seems like I could arbitrarily do this to most entities. What puzzles me is that Product_Color and Color are already in 6th normal form when we started the exercise (right?).
Solution 1:
You are introducing a "surrogate key" (or identifier) to name/identify sets of colours that products come in. The alternative is usually considered to be a "natural key" (or identifier). (Although different people use these terms differently in detail. Eg some might only use "surrogate" when a name/identifier is assigned a referent permanently and/or is its referent's only name/identifier and/or it is visible only withing the database & not the application. Eg some would say that an externally visible system-generated arbitrary name/identifier like a Driver Identification Number is both a surrogate and natural.)
Surrogate keys are often called "meaningless (identifiers)". This reflects muddled thinking. All names not generated by an a priori naming scheme are "meaningless" & arbitrary. "Nicholas" did not "mean" you until it was chosen; having been chosen, it "means" you. This goes for any name/identifier. So "meaningless"/"meaningful" is not a helpful distinction. A surrogate name/identifier in a system is just one that got chosen after the system started. What gets called "meaningful" [sic] in a system would have been called "meaningless" [sic] when assigned in whatever system existed before (since assignment was after it started).
There is a "perspective" in which you are "removing redundant information", but it's not the kind of redundancy that normalization addresses. You are replacing a table by other tables, but it's not normalization decomposition. Introduction of surrogates is not part of normalization. Normalization does not introduce new column names. It just reuses an original table's names in the tables that replace it. (Are you able to clearly and exactly describe just what you mean by "redundant" here?)
Sometimes people think that if the same subtuple of values can appear more than once in a column set or table then those subrow values need to be replaced by ids that are FKs to a new table that maps id values to subrow values. (Maybe even for single-column subrows, ie when a single value appears more than once in a column or table.) They think that multiple subrow value appearances are "redundant" or that only ids can repeat without being "redundant". (The id design is seen as a kind of data compression of the original.) They may think that this is part of normalization. None of this is so.
This is not redundancy that you should bother to address via table design. If you know the implementation options of your tables by the DBMS and you know the usage patterns of your application and you know that the original is demonstrably and meaningfully worse than some option that happens to be "less redundant" (and why wouldn't a "more redundant" option be better?) then you should tell the DBMS what option you want for your design without changing the schema if you can. (This is typically done via indexes and/or views.) Eg indexing your original Product_Color on ColorId leads to essentially the same structure in the implementation as you have created by hand in your second design, but automatically generated and managed. (You might introduce surrogates for other reasons, eg to replace multiple-column foreign keys by more concise although more obscurely valued and constrained ones.)
Re options: Your new design will use more operations (eg joins and projections) in query text and (for typical DBMS implementations) execution than the original (eg to query for the original table) but fewer elsewhere (eg in copying one product's colour set to another's). So again it is all about tradeoffs of multiple "perspectives".
In fact you have in another sense introduced redundancy with the surrogates. There are additional columns holding a bunch of id values that are not in the original yet that record the same situations. You have also burdened the user with a design with more naming and indirections. The surrogate design certainly has a lot of "redundant information" in this "perspective" compared to the original.
Even your starting design has probably introduced surrogates, namely colour ids of colour names. (If colour ids added "information", ie "informed" you beyond just their associated names, then they would not be surrogates and would be necessary.) Ie if colour ids are chosen arbitrarily then you could just have:
Product_Color
+-------------+-------------+
| Product* | ColorName* |
+-------------+-------------+
| 1 | Blue |
| 1 | Green |
| 2 | Blue |
| 2 | Green |
+-------------+-------------+
You should have a reason to introduce colour ids, and for that matter product ids, rather than natural keys already existing. Can you justify your multiple tables, names and indirections vs just one?