Find phone numbers in python script

If you are interested in learning Regex, you could take a stab at writing it yourself. It's not quite as hard as it's made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.

Edit: Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I'm still learning Regex, I'm not sure that it's perfect. Give it a try.

(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})

It matched the following values in RegexPal:

000-000-0000
000 000 0000
000.000.0000

(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000

000-0000
000 0000
000.0000

0000000
0000000000
(000)0000000

This is the process of building a phone number scraping regex.

First, we need to match an area code (3 digits), a trunk (3 digits), and an extension (4 digits):

reg = re.compile("\d{3}\d{3}\d{4}")

Now, we want to capture the matched phone number, so we add parenthesis around the parts that we're interested in capturing (all of it):

reg = re.compile("(\d{3}\d{3}\d{4})")

The area code, trunk, and extension might be separated by up to 3 characters that are not digits (such as the case when spaces are used along with the hyphen/dot delimiter):

reg = re.compile("(\d{3}\D{0,3}\d{3}\D{0,3}\d{4})")

Now, the phone number might actually start with a ( character (if the area code is enclosed in parentheses):

reg = re.compile("(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")

Now that whole phone number is likely embedded in a bunch of other text:

reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")

Now, that other text might include newlines:

reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)

Enjoy!

I personally stop here, but if you really want to be sure that only spaces, hyphens, and dots are used as delimiters then you could try the following (untested):

reg = re.compile(".*?(\(?\d{3})? ?[\.-]? ?\d{3} ?[\.-]? ?\d{4}).*?", re.S)

I think this regex is very simple for parsing phone numbers

re.findall("[(][\d]{3}[)][ ]?[\d]{3}-[\d]{4}", lines)

For spanish phone numbers I use this with quite success:

re.findall( r'[697]\d{1,2}.\d{2,3}.\d{2,3}.\d{0,2}',str)

Find phone numbers in python script

Related

Recent Posts