Extracting email from text string in excel/google spreadsheet
I'm looking to extract email addresses from text strings in Google Docs/Excel and email them automatically, where the content of the cells are currently unknown (until updated by user).
For example, they might read:
Hi,
My friend [email protected].
Thanks,
Example Joe
OR
You should ask
I've currently got this formula:
=IFERROR((LEFT(CELL,FIND("@",CELL)-1))&"@"&(regexextract(CELL,"@(.*)")))
It works for almost all cases, except
a) as in example 1, where someone puts a , or . at the end of the email
b) as in example 2, where the email starts on a new line, it comes in as eg.
ask
How might I adapt the formula to correct these?
Solution 1:
For Google Spreadsheets
Google Spreadsheets has these cool already-builtin regex formulas
- REGEXEXTRACT, REGEXMATCH, REGEXREPLACE
We use the first one to extract mail adresses. Put this formula in a Google Spreadsheet cell:
=iferror(Regexextract(A1;"[A-z0-9._%+-]+@[A-z0-9.-]+\.[A-z]{2,4}");"")
-
A1
is the cell where the string to check (mail body in your case) should reside - Complete formula consist of 2 parts. The inner part is the regex formula and the outer part is for error prevention
-
Regexextract(A1,"\[A-z0-9._%+-\]+@\[A-z0-9.-\]+\.\[A-z\]{2,4}")
returns the mail address -
iferror(innerformula,"")
prevents#N/A
when the regex formula wasn't able to return anything e.g no valid mail address was found
How does the regex pattern work?
[A-z0-9._%+-]+@[A-z0-9.-]+.[A-z]{2,4}
-
A-z
represents any character between anA
and az
.
Note the uppercase and lowercase. This way it's case-insensitive -
0-9
represents any digit -
._%+-
represent those signs itself -
[ ]
represents a single character which is allowed inside the brackets - Putting a
+
sign behind[ ]
allows the previous pattern to be repeated infinitely -
@
has no special meaning. It literally searches for a@
sign -
[A-z0-9.-]+
is the same as above. But_%+
are not allowed behind a@
sign this time -
\.
searches for a single point. It has to be escaped with a preceeding\
because.
normally is a placeholder any character - At last
[A-z]{2,4}
searches for 2,3 or 4 case-insensitive characters
Used ressources
- http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
Solution 2:
This is for Excel.
Consider the following User Defined Function (UDF):
Public Function GetEmailAddy(Sin As String) As String
Dim s As String
If InStr(1, Sin, "@") = 0 Then
GetEmailAddy = ""
Exit Function
End If
s = Replace(Sin, Chr(10), " ")
s = Replace(s, Chr(13), " ")
s = Application.WorksheetFunction.Trim(s)
ary = Split(s, " ")
For Each a In ary
If InStr(1, a, "@") > 0 Then
GetEmailAddy = a
Exit Function
End If
Next a
End Function
User Defined Functions (UDFs) are very easy to install and use:
- ALT-F11 brings up the VBE window
- ALT-I ALT-M opens a fresh module
- paste the stuff in and close the VBE window
If you save the workbook, the UDF will be saved with it. If you are using a version of Excel later then 2003, you must save the file as .xlsm rather than .xlsx
To remove the UDF:
- bring up the VBE window as above
- clear the code out
- close the VBE window
To use the UDF from Excel:
=GetEmailAddy(A1)
To learn more about macros in general, see:
http://www.mvps.org/dmcritchie/excel/getstarted.htm
and
http://msdn.microsoft.com/en-us/library/ee814735(v=office.14).aspx
and for specifics on UDFs, see:
http://www.cpearson.com/excel/WritingFunctionsInVBA.aspx
Macros must be enabled for this to work!
Here are some examples: