Is there a standard for storing normalized phone numbers in a database?

What is a good data structure for storing phone numbers in database fields? I'm looking for something that is flexible enough to handle international numbers, and also something that allows the various parts of the number to be queried efficiently.

Edit: Just to clarify the use case here: I currently store numbers in a single varchar field, and I leave them just as the customer entered them. Then, when the number is needed by code, I normalize it. The problem is that if I want to query a few million rows to find matching phone numbers, it involves a function, like

where dbo.f_normalizenum(num1) = dbo.f_normalizenum(num2)

which is terribly inefficient. Also queries that are looking for things like the area code become extremely tricky when it's just a single varchar field.

[Edit]

People have made lots of good suggestions here, thanks! As an update, here is what I'm doing now: I still store numbers exactly as they were entered, in a varchar field, but instead of normalizing things at query time, I have a trigger that does all that work as records are inserted or updated. So I have ints or bigints for any parts that I need to query, and those fields are indexed to make queries run faster.


Solution 1:

First, beyond the country code, there is no real standard. About the best you can do is recognize, by the country code, which nation a particular phone number belongs to and deal with the rest of the number according to that nation's format.

Generally, however, phone equipment and such is standardized so you can almost always break a given phone number into the following components

  • C Country code 1-10 digits (right now 4 or less, but that may change)
  • A Area code (Province/state/region) code 0-10 digits (may actually want a region field and an area field separately, rather than one area code)
  • E Exchange (prefix, or switch) code 0-10 digits
  • L Line number 1-10 digits

With this method you can potentially separate numbers such that you can find, for instance, people that might be close to each other because they have the same country, area, and exchange codes. With cell phones that is no longer something you can count on though.

Further, inside each country there are differing standards. You can always depend on a (AAA) EEE-LLLL in the US, but in another country you may have exchanges in the cities (AAA) EE-LLL, and simply line numbers in the rural areas (AAA) LLLL. You will have to start at the top in a tree of some form, and format them as you have information. For example, country code 0 has a known format for the rest of the number, but for country code 5432 you might need to examine the area code before you understand the rest of the number.

You may also want to handle vanity numbers such as (800) Lucky-Guy, which requires recognizing that, if it's a US number, there's one too many digits (and you may need to full representation for advertising or other purposes) and that in the US the letters map to the numbers differently than in Germany.

You may also want to store the entire number separately as a text field (with internationalization) so you can go back later and re-parse numbers as things change, or as a backup in case someone submits a bad method to parse a particular country's format and loses information.

Solution 2:

KISS - I'm getting tired of many of the US web sites. They have some cleverly written code to validate postal codes and phone numbers. When I type my perfectly valid Norwegian contact info I find that quite often it gets rejected.

Leave it a string, unless you have some specific need for something more advanced.

Solution 3:

The Wikipedia page on E.164 should tell you everything you need to know.

Solution 4:

Here's my proposed structure, I'd appreciate feedback:

The phone database field should be a varchar(42) with the following format:

CountryCode - Number x Extension

So, for example, in the US, we could have:

1-2125551234x1234

This would represent a US number (country code 1) with area-code/number (212) 555 1234 and extension 1234.

Separating out the country code with a dash makes the country code clear to someone who is perusing the data. This is not strictly necessary because country codes are "prefix codes" (you can read them left to right and you will always be able to unambiguously determine the country). But, since country codes have varying lengths (between 1 and 4 characters at the moment) you can't easily tell at a glance the country code unless you use some sort of separator.

I use an "x" to separate the extension because otherwise it really wouldn't be possible (in many cases) to figure out which was the number and which was the extension.

In this way you can store the entire number, including country code and extension, in a single database field, that you can then use to speed up your queries, instead of joining on a user-defined function as you have been painfully doing so far.

Why did I pick a varchar(42)? Well, first off, international phone numbers will be of varied lengths, hence the "var". I am storing a dash and an "x", so that explains the "char", and anyway, you won't be doing integer arithmetic on the phone numbers (I guess) so it makes little sense to try to use a numeric type. As for the length of 42, I used the maximum possible length of all the fields added up, based on Adam Davis' answer, and added 2 for the dash and the 'x".