How should international geographical addresses be stored in a relational database?
Given the task of storing international geographic addresses in a relational table, what is the most flexible schema? Should every part of the address be broken out into their own fields, or should it be more like free text?
Is there any sense in separating differently formatted address into different tables? For example, have a table for USAAddress, CanadianAddress, UKAddress...?
Solution 1:
I will summarize my thoughts from my blog post - A lesson in address storage (on archive.org).
On my current project [I work for a logistics company] we're storing international addresses. I've done research on addresses all over the world in the design of this portion of the database. There's a lot of different formats. In the Western world we tend to use a fairly uniform format - a few differences but they're mostly:
- Street Number - Numeric
- House or Building Name - [VarChar - in the UK some houses/buildings are identified by name, not by number]
-
Street Number Suffix [VarChar, although in most cases, Char(1) would suffice]
- A, B etc
- Street Name [VarChar]
-
Street Type [VarChar or Int if you have a StreetTypes table]
- So far, I've found 262 unique types in the English speaking world, there are likely more, and don't forget other languages i.e. Strasse, Rue etc.
-
Street Direction [VarChar(2)]
- N, E, S, W, NE, SE, NW, SW
-
Address Type [VarChar or Int if you have an AddressTypes table]
- PO Box
- Apartment
- Building
- Floor
- Office
- Suite
- etc...
-
Address Type Identifier [VarChar]
- i.e. Box Number, Apartment Number, Floor Number remember apartment numbers and offices sometimes have alphanumeric info - like 1A
-
Local Municipality [VarChar or Int if you have a Municipalities table]
- For instance, if your hamlet/village appears in the address before the town.
- City/Town [VarChar or Int if you have a Cities table]
-
Governing District [VarChar or Int if you have a Districts table]
- State (U.S.)
- Province (Canada)
- Federal District (Mexico)
- County (U.K.)
- etc...
-
Postal Area [VarChar]
- Zip (U.S.)
- Postal Code (Canada, Mexico)
- Postcode (U.K.)
- Country [VarChar or Int if you have a Countries table]
This appears to cover most countries but the ordering of the fields may be displayed differently. You can find a list of display formats at http://www.bitboost.com/ref/international-address-formats.html#Formats
For instance, in many countries, the postal code falls before the city name and the street number falls after the street name. In Canada, U.S. and the U.K. the street number precedes the street name and the postal code (or ZIP) comes after the city name.
In answer to your question about separation of the addresses into different countries, I wouldn't suggest it, it will just make life harder in other areas - for instance reporting. The format I've provided covers all the addresses in our logistics database which covers USA, Canada, Mexico and the UK without any problems. It also covers all of our European, Chinese, Japanese and Malaysian addresses. I can't speak for other countries but I haven't yet had to store an address from a country that these fields won't support.
I don't suggest going with the Address1, Address2, Address3 format suggested by others and seen in many databases because parsing address information out of an alphanumeric string isn't as simple as it might first seem - especially if data isn't entered correctly, due to misinformation, typo, misspelling etc. If you separate your fields you can use distance algorithms to check for likely meaning, use probability to check street name against postal code and street number or to check province and city against street name etc. Try doing any of that when you've got a string denoting your whole street address. It's not a trivial matter by any stretch of the imagination.
QA on an address database is a headache, period. The easiest way to simplify your life in this area is to make sure all the fields hold only a single piece of information that can be automatically verified as correct at entry time. Probability, distance algorithms and regular expressions can check for validity of entry and provide feedback to the user as to what their mistake was and suggest suitable corrections.
One caveat to be aware of is roads with names that are also street types - if you're covering Canada you need to be aware of "Avenue Road" in Toronto which will trip you up big time if you're using the Address1, 2, 3 format. This likely occurs in other places too, although I'm not aware of them - this single instance was enough for me to scream WTF?!
Solution 2:
Be careful not to over-analyze address formats. When you do, you're quite likely to end up with a specification most users will need to work around, effectively forcing them to use the wrong fields, or only filling the primary fields and ignoring the extra fields.
Keep things simple.
A StreetType like mentioned by BenAlabaster will cause problems when you start working with languages different from isolating languages like English or Spanish.
To show you how bad things can get in the wild: the "Henriette Roland Holststraat" in Amsterdam, built up from "Henriette" + "Roland Holst" + "straat", which can be abbreviated as the "Roland Holststraat", or "Roland Holststr.", or misspelled as "H.R.Holststr." or "Henriette Roland-Holst straat", depending on the weather. Unless you've got an up-to-date street register for each country on earth, you'll be going nowhere.
And finally, be careful that in some multilingual countries, names can be different from one language to another! For instance in Brussels where many streets have both a French and a Dutch name: "Avenu du Port" and "Havenlaan", depending on the addressee's preferred language. (Google Maps shows both names alternately, just to be on the safe side.)
You can try to devise all kinds of clever tricks here, but are the sales reps. going to understand this?