What to do with null values when modeling and normalizing?
I'm new in SQL(still learning) and I have to create a database for a venue. A client book for a room for an event. The problem is that the clients don't always provide their name, their email, and their phone number. Most of the time it's either name and email or name and phone. It's rarely all 3 but it happens. I need to store each of these in their respective attribute (name, email, phone). But the way they give me their info, I have a lot of null values. What can I do with these nulls? I've been told that it's better to not have nulls. I also need to normalize my table after that. Any suggestions please.
SQL treats NULL specially per its version of 3VL (3-valued logic). Normalization & other relational theory does not. However, we can translate SQL designs into relational designs and back. (Assume no duplicate rows here.)
Normalization happens to relations and is defined in terms of operators that don't treat NULL specially. The term "normalization" has two most common distinct meanings: putting a table into "1NF" and into "higher NFs (normal forms)". NULL doesn't affect "normalization to 1NF". "Normalization to higher NFs" replaces a table by smaller tables that natural join back to it. For purposes of normalization you could treat NULL like a value that is allowed in the domain of a nullable column in addition to the values of its SQL type. If our SQL tables have no NULLs then we can interpret them as relations & SQL join etc as join, etc. But if you decompose where a nullable column was shared between components then realize that to reconstruct the original in SQL you have to SQL join on same-named columns being equal or both NULL. And you won't want such CKs (candidate keys) in an SQL database. Eg you can't declare it as an SQL PK (primary key) because that means UNIQUE NOT NULL. Eg a UNIQUE constraint involving a nullable column allows multiple rows that have a NULL in that column, even if the rows have the same values in every column. Eg NULLs in SQL FKs cause them to be satisfied (in various ways per MATCH mode), not to fail from not appearing in the referenced table. (But DBMSs idiosyncratically differ from standard SQL.)
Unfortunately decomposition might lead to a table with all CKs containing NULL, so that we have nothing to declare as SQL PK or UNIQUE NOT NULL. The only sure solution is to convert to a NULL-free design. After then normalizing we might want to reintroduce some nullability in the components.
In practice, we manage to design tables so that there is always a set of NULL-free columns that we can declare as CK, via SQL PK or UNIQUE NOT NULL. Then we can get rid of a nullable column by dropping it from the table and adding a table with that column and the columns of some NULL-free CK: If the column is non-NULL for a row in the old design then a row with its CK subrow and column value go in the added table; otherwise it is NULL in the old design and no corresponding row is in the added table. (The original table is a natural left join of the new ones.) Of course, we also have to modify queries from the old design to the new design.
We can always avoid NULLs via a design that adds a boolean column for each old nullable column and has the old column NOT NULL. The new column says for a row whether the old column was NULL in the old design and when true has the old column be some one value that we pick for that purpose for that type throughout the database. Of course, we also have to modify queries from the old design to the new design.
Whether you want to avoid NULL is a separate question. Your database might in some way be "better" or "worse" for your application with either design. The idea behind avoiding NULL is that it complicates the meanings of queries, hence complicates querying, in a perverse way, compared to the complication of more joins from more NULL-free tables. (That perversity is typically managed by removing NULLs in query expressions as close to where they appear as possible.)
PS Many SQL terms including PK & FK differ from the relational terms. SQL PK means something more like superkey; SQL FK means something more like foreign superkey; but it doesn't even make sense to talk about a "superkey" in SQL:
Because of the resemblance of SQL tables to relations, terms that involve relations get sloppily applied to tables. But although you can borrow terms and give them SQL meanings--value, table, FD (functional dependency), superkey, CK (candidate key), PK (primary key), FK (foreign key), join, and, predicate, NF (normal form), normalize, 1NF, etc--you can't just substitute those SQL meanings for those words in RM definitions, theorems or algorithms and get something sensible or true. Moreover SQL presentations of RM notions almost never actually tell you how to soundly apply RM notions to an SQL database. They just parrot RM presentations, oblivious to whether their use of SQL meanings for terms makes things nonsensical or invalid.
First of all there is nothing wrong with nulls in a database. And they are made exactly for this purpose where attributes are unknown. To avoid nulls in a database is an advice that makes little sense in my opinion.
So you'd have three (or four) values - name (first/last), email address, and phone number - identifying a client. You can have them in a table and add a constraint to it assuring that always at least one of these columns is filled, e.g. coalesce(name, email, phone) is not null
. This makes sure a booking cannot be done completely anonymously.
From your explanation it is not clear whether you will always have the same information from a client. So can it happen that a client books a room giving their name and later they book another room giving their phone instead? Or will the client be looked up in the database, their name found and the two booking assigned to them? In the latter case you can have a clients table holding all information you got so far, and the booking will contain the client record ID as a reference to this data. In the former case you may not want to have a clients table, because you cannot identify whether two clients (Jane Miller and [email protected]) are really two different clients or only one client actually.
The tables I see so far:
- room (room_id, ...)
- venue (venue_id, ...)
- client (client_id, name, email, phone)
- booking (venue_id, room_id, client_id, ...)