Parse usable Street Address, City, State, Zip from a string [closed]

I've done a lot of work on this kind of parsing. Because there are errors you won't get 100% accuracy, but there are a few things you can do to get most of the way there, and then do a visual BS test. Here's the general way to go about it. It's not code, because it's pretty academic to write it, there's no weirdness, just lots of string handling.

(Now that you've posted some sample data, I've made some minor changes)

  1. Work backward. Start from the zip code, which will be near the end, and in one of two known formats: XXXXX or XXXXX-XXXX. If this doesn't appear, you can assume you're in the city, state portion, below.
  2. The next thing, before the zip, is going to be the state, and it'll be either in a two-letter format, or as words. You know what these will be, too -- there's only 50 of them. Also, you could soundex the words to help compensate for spelling errors.
  3. before that is the city, and it's probably on the same line as the state. You could use a zip-code database to check the city and state based on the zip, or at least use it as a BS detector.
  4. The street address will generally be one or two lines. The second line will generally be the suite number if there is one, but it could also be a PO box.
  5. It's going to be near-impossible to detect a name on the first or second line, though if it's not prefixed with a number (or if it's prefixed with an "attn:" or "attention to:" it could give you a hint as to whether it's a name or an address line.

I hope this helps somewhat.


I think outsourcing the problem is the best bet: send it to the Google (or Yahoo) geocoder. The geocoder returns not only the lat/long (which aren't of interest here), but also a rich parsing of the address, with fields filled in that you didn't send (including ZIP+4 and county).

For example, parsing "1600 Amphitheatre Parkway, Mountain View, CA" yields

{
  "name": "1600 Amphitheatre Parkway, Mountain View, CA, USA",
  "Status": {
    "code": 200,
    "request": "geocode"
  },
  "Placemark": [
    {
      "address": "1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA",
      "AddressDetails": {
        "Country": {
          "CountryNameCode": "US",
          "AdministrativeArea": {
            "AdministrativeAreaName": "CA",
            "SubAdministrativeArea": {
              "SubAdministrativeAreaName": "Santa Clara",
              "Locality": {
                "LocalityName": "Mountain View",
                "Thoroughfare": {
                  "ThoroughfareName": "1600 Amphitheatre Pkwy"
                },
                "PostalCode": {
                  "PostalCodeNumber": "94043"
                }
              }
            }
          }
        },
        "Accuracy": 8
      },
      "Point": {
        "coordinates": [-122.083739, 37.423021, 0]
      }
    }
  ]
}

Now that's parseable!


The original poster has likely long moved on, but I took a stab at porting the Perl Geo::StreetAddress:US module used by geocoder.us to C#, dumped it on CodePlex, and think that people stumbling across this question in the future may find it useful:

US Address Parser

On the project's home page, I try to talk about its (very real) limitations. Since it is not backed by the USPS database of valid street addresses, parsing can be ambiguous and it can't confirm nor deny the validity of a given address. It can just try to pull data out from the string.

It's meant for the case when you need to get a set of data mostly in the right fields, or want to provide a shortcut to data entry (letting users paste an address into a textbox rather than tabbing among multiple fields). It is not meant for verifying the deliverability of an address.

It doesn't attempt to parse out anything above the street line, but one could probably diddle with the regex to get something reasonably close--I'd probably just break it off at the house number.